<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: kination</title>
    <description>The latest articles on Forem by kination (@kination).</description>
    <link>https://forem.com/kination</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1495973%2Fcc25a858-4e28-44a0-a623-9e3d0ae5e328.png</url>
      <title>Forem: kination</title>
      <link>https://forem.com/kination</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/kination"/>
    <language>en</language>
    <item>
      <title>Skill to make you slow, but to go fast</title>
      <dc:creator>kination</dc:creator>
      <pubDate>Sun, 22 Mar 2026 01:24:59 +0000</pubDate>
      <link>https://forem.com/kination/slow-skill-to-go-fast-26ih</link>
      <guid>https://forem.com/kination/slow-skill-to-go-fast-26ih</guid>
      <description>&lt;p&gt;Here's one skillsets I want to share.&lt;/p&gt;


&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/kination" rel="noopener noreferrer"&gt;
        kination
      &lt;/a&gt; / &lt;a href="https://github.com/kination/slow-slow-quick-quick" rel="noopener noreferrer"&gt;
        slow-slow-quick-quick
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      Skills for slo-mo, to go fast later.
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;slow-slow-quick-quick&lt;/h1&gt;
&lt;/div&gt;
&lt;p&gt;A collection of AI assistant skills built around intentional friction. It slows down interactions to deepen user's understanding, form muscle memory, and produce work that reflects the user's genuine thinking rather than AI-generated output.&lt;/p&gt;
&lt;p&gt;The name reflects the deliberate practice philosophy: &lt;strong&gt;go slow now, to go fast later.&lt;/strong&gt;&lt;/p&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Skills&lt;/h2&gt;
&lt;/div&gt;
&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Skill&lt;/th&gt;
&lt;th&gt;Trigger&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;slow-vibe-coding&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Implementing a feature, writing a function, solving a coding problem&lt;/td&gt;
&lt;td&gt;Refuses to write implementation code. Brainstorms patterns, presents at least two concrete approaches with trade-offs, provides comment-only skeletons, and guides you to write it yourself.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;slow-vibe-sw-architect&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;System design, architecture decisions, technical choices&lt;/td&gt;
&lt;td&gt;Refuses to give an immediate answer. Asks trade-off questions and failure scenarios until you arrive at a decision you own. Produces an ADR.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;slow-dev-research&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Technology selection, stack comparisons, engineering trade-off questions&lt;/td&gt;
&lt;td&gt;Refuses to give a recommendation upfront. Surfaces research and real-world cases with sources, asks requirements questions one at a time, and&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;…&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/kination/slow-slow-quick-quick" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;


&lt;p&gt;LLMs change how fast software is being made. Tasks that used to take days now finishes in minutes. This speed creates a specific problem for the engineers who's in charge. It is remarkably easy to become a spectator in your own project.&lt;/p&gt;

&lt;p&gt;I have felt this disconnection during recent work. When AI implements a feature, the growth of codebase is so fast, that sometimes human brain can barely keep up with. You are directing the work, but implementation details can bypass your understanding. You will feel that the work end up with a gap between what the system does and what you actually know about its inner workings.&lt;/p&gt;

&lt;p&gt;Don't worry. It's not your fault, and it does not mean you are incompetent engineer. Strictly, in terms of the "speed" of software creation and problem solving, no human can even go close to current AI’s . That is how far AI has come today.&lt;/p&gt;

&lt;h2&gt;
  
  
  Make slowly, to go fast later.
&lt;/h2&gt;

&lt;p&gt;True productivity follows a rhythm of "Slow, Slow, Quick, Quick." &lt;/p&gt;

&lt;p&gt;This means alternating 'time to think' with implementation speed performance. Slowing down to analyze a design choice is not a waste of time. It is necessary investment. Building a clear mental model during the "slow" phases ensures the "quick" phases stay grounded. Without this balance, you just make shallow decisions that lead to huge technical debt.&lt;/p&gt;

&lt;p&gt;This is the motivation for this skill.&lt;/p&gt;

&lt;p&gt;Here's how to start for &lt;code&gt;claude-code&lt;/code&gt; users:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;/plugin marketplace add kination/slow-slow-quick-quick
&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;/plugin &lt;span class="nb"&gt;install &lt;/span&gt;slow-engineering@slow-slow-quick-quick
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;(will make for other LLMs soon...)&lt;/p&gt;

&lt;h2&gt;
  
  
  Strategic Implementation with &lt;code&gt;slow-vibe-coding&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;slow-vibe-coding&lt;/code&gt; skill is one way to stay involved on coding. Typical assistants dump blocks of code, but this skill follows a strict protocol. AI is prohibited from writing implementation logic. It guides you through the problem and the patterns, then it stops. &lt;/p&gt;

&lt;p&gt;The AI generates a skeleton of comments and a "pass" statement. Though they guide with well-made comments or making base code by your description, you will have change to involve in every output.&lt;/p&gt;

&lt;p&gt;This builds muscle memory and maintains ownership. By the end, you have to explain details in review process. The AI becomes a high level director, not a code generator.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architectural Integrity with &lt;code&gt;slow-vibe-sw-architect&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;Design is usually the first thing lost when you move too fast. The &lt;code&gt;slow-vibe-sw-architect&lt;/code&gt; skill enforces a pre-implementation step. It prohibits recommendations until you evaluate three tradeoffs and two failure scenarios. This forced delay explores scale, consistency, and complexity. &lt;/p&gt;

&lt;p&gt;The result is a co-authored "Architecture Decision Record" (ADR). Documenting the "why" keeps the system coherent. It prevents AI from building something no human can debug.&lt;/p&gt;

&lt;p&gt;So, here's the summary of these 2.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn4b0ox0pjhxbqwxwzikg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn4b0ox0pjhxbqwxwzikg.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  And one more, &lt;code&gt;slow-dev-research&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;Technological decision often get made on familiarity, not fit. Result of selecting 'message queue' or 'database' under time pressure usually heads to choosing what team already knows. The research phase gets skipped because it feels like overhead. This is where a lot of technical debt originates.                                         &lt;/p&gt;

&lt;p&gt;Skill "slow-dev-research" adds friction before choices get made. The AI collects benchmarks, cases from real world, and known trade-offs with sources. After that it asks about your specific constraints one at a time. It can be expected traffic, team experience, consistency requirements, cost model. It refuses to give a recommendation until you have surfaced your own requirements. &lt;/p&gt;

&lt;p&gt;The output is a research document with a trade-off map you fill out and a conclusion section you write yourself. It feeds directly into slow-vibe-sw-architect when a design decision follows.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;These are the starting items of "slow" skills. I am planning more to cover the full lifecycle 'slowly', from research to verification.&lt;/p&gt;

&lt;p&gt;Using LLMs is inevitable. I love coding, and still believe in the competitive edge of human-made code. But even so, AI is simply too fast and too productive to ignore.&lt;/p&gt;

&lt;p&gt;But long term productivity requires more than raw generation. Working slow and deep at critical points will improve your understanding, to move quick later. Catching up may be impossible, but falling behind is not an option to remain as an engineer. Our mastery of the architecture must grow alongside the code it produces.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>skill</category>
      <category>development</category>
    </item>
    <item>
      <title>build-my-own-datalake: Improve metadata with caching</title>
      <dc:creator>kination</dc:creator>
      <pubDate>Sat, 28 Feb 2026 06:37:35 +0000</pubDate>
      <link>https://forem.com/kination/build-my-own-datalake-improve-metadata-with-caching-2lbl</link>
      <guid>https://forem.com/kination/build-my-own-datalake-improve-metadata-with-caching-2lbl</guid>
      <description>&lt;p&gt;This is the next story of following post: &lt;a href="https://dev.to/kination/build-my-own-datalake-part-1-367h"&gt;https://dev.to/kination/build-my-own-datalake-part-1-367h&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Building a High-Performance Metadata System with Global Caching
&lt;/h1&gt;

&lt;p&gt;&lt;em&gt;Caching schema metadata at the JNI boundary to eliminate per-write filesystem reads&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;I've made a goal for project &lt;a href="https://github.com/kination/vine" rel="noopener noreferrer"&gt;vine&lt;/a&gt; as write-optimized data lake format. And what I didn't expect was that, reading a small JSON file would be the bottleneck.&lt;/p&gt;

&lt;p&gt;My initial implementation was spending a significant chunk of each write just loading a schema definition from disk—repeatedly, on every operation.&lt;/p&gt;

&lt;p&gt;For a system handling thousands of writes per second, this compounds fast.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Initial Implementation
&lt;/h2&gt;

&lt;p&gt;Like many data lake formats, Vine uses a metadata file to define table schemas:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"table_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"user_events"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"fields"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"user_id"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="nl"&gt;"data_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"long"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="nl"&gt;"is_required"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"event_type"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nl"&gt;"data_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"string"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"is_required"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both read and write paths load this file on every call:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;load_metadata&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Metadata&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;meta_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"vine_meta.json"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;read_to_string&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;meta_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  &lt;span class="c1"&gt;// filesystem hit every time&lt;/span&gt;
    &lt;span class="nn"&gt;serde_json&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;from_str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.map_err&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;Into&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;into&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;write_user_events&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Vec&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;UserEvent&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_metadata&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  &lt;span class="c1"&gt;// called on EVERY write&lt;/span&gt;

    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;today&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;chrono&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;Utc&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"%Y-%m-%d"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.to_string&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;date_dir&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;today&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nn"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;create_dir_all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;date_dir&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;output_file&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;date_dir&lt;/span&gt;&lt;span class="nf"&gt;.join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nd"&gt;format!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="s"&gt;"data_{}.parquet"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="nn"&gt;chrono&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;Utc&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"%H%M%S_%f"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;));&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;parquet_schema&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;metadata_to_parquet_schema&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;writer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;parquet&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;file&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;writer&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;SerializedFileWriter&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nn"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;File&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output_file&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;parquet_schema&lt;/span&gt;&lt;span class="nf"&gt;.clone&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="nn"&gt;Default&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;default&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;writer&lt;/span&gt;&lt;span class="nf"&gt;.write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;event_to_parquet_row&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;writer&lt;/span&gt;&lt;span class="nf"&gt;.close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(())&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The naive approach which I've decided first was just reading &lt;code&gt;vine_meta.json&lt;/code&gt; on every write operation. It was doing disk reads for data that never changed.&lt;/p&gt;




&lt;h2&gt;
  
  
  It's just a small JSON file. Why does it matter?
&lt;/h2&gt;

&lt;p&gt;That's a natural first reaction. This metadata JSON file is maybe 1-2KB. OS page cache should keep it hot. Why would this make it slow?&lt;/p&gt;

&lt;p&gt;It turns out the file size is almost irrelevant. What's expensive is everything &lt;em&gt;around&lt;/em&gt; the read.&lt;/p&gt;

&lt;h3&gt;
  
  
  Syscall overhead
&lt;/h3&gt;

&lt;p&gt;Every &lt;code&gt;fs::read_to_string()&lt;/code&gt; call goes through at least &lt;code&gt;open()&lt;/code&gt;, &lt;code&gt;read()&lt;/code&gt;, and &lt;code&gt;close()&lt;/code&gt;—three syscalls minimum, each of which requires a user-to-kernel context switch. System calls have fixed overhead regardless of how much data they transfer. A study measuring &lt;a href="https://arkanis.de/weblog/2017-01-05-measurements-of-system-call-performance-and-overhead/" rel="noopener noreferrer"&gt;Linux syscall overhead&lt;/a&gt; puts the baseline at roughly 1-4 microseconds per call on modern hardware, before any I/O happens.&lt;/p&gt;

&lt;p&gt;At 1000 writes/second, that's continuous context-switching pressure. A small file just means you're paying that fixed overhead without getting much data for it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Page cache doesn't eliminate overhead
&lt;/h3&gt;

&lt;p&gt;Even if OS caches file contents, you still pay for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;VFS path resolution (walking the directory tree)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;dentry&lt;/code&gt; and &lt;code&gt;inode&lt;/code&gt; lookups in the kernel&lt;/li&gt;
&lt;li&gt;Lock acquisition on the file&lt;/li&gt;
&lt;li&gt;JSON deserialization (on every call, even on a page-cache hit)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/8/html/monitoring_and_managing_system_status_and_performance/factors-affecting-i-o-and-file-system-performance_monitoring-and-managing-system-status-and-performance" rel="noopener noreferrer"&gt;Red Hat documentation on I/O performance factors&lt;/a&gt; notes that small files generate disproportionate overhead relative to data transferred, because fixed costs of the filesystem stack dominate. Research from CMU's Parallel Data Lab on &lt;a href="https://www.pdl.cmu.edu/PDL-FTP/FS/CMU-PDL-14-103.pdf" rel="noopener noreferrer"&gt;scaling file system metadata performance&lt;/a&gt; shows that dcache and inode lookup latency is often the limiting factor for metadata-heavy workloads, not disk bandwidth.&lt;/p&gt;

&lt;h3&gt;
  
  
  Industry validation: this is a known problem
&lt;/h3&gt;

&lt;p&gt;This isn't unique to Vine. Major data systems have all hit the same wall:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://iceberg.apache.org/" rel="noopener noreferrer"&gt;Apache Iceberg&lt;/a&gt; was explicitly designed to avoid Hive Metastore's pattern of fetching partition locations (a metadata call) followed by file system lookups (a storage call) on every query. The &lt;a href="https://www.chaosgenius.io/blog/iceberg-vs-delta-lake-metadata-indexing/" rel="noopener noreferrer"&gt;Iceberg vs Delta Lake metadata comparison&lt;/a&gt; explains how Iceberg's manifest files cache partition-to-file mappings to avoid this double-lookup pattern.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://cloud.google.com/bigquery" rel="noopener noreferrer"&gt;Google BigQuery&lt;/a&gt; added &lt;a href="https://cloud.google.com/bigquery/docs/metadata-caching" rel="noopener noreferrer"&gt;metadata caching for external tables&lt;/a&gt; specifically because "listing millions of files from external data sources can take several minutes" without it. Their documentation notes that caching metadata avoids repeated round-trips to external storage on every query.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://azure.microsoft.com/en-us/products/storage/files" rel="noopener noreferrer"&gt;Microsoft Azure Files&lt;/a&gt; &lt;a href="https://techcommunity.microsoft.com/blog/azurestorageblog/accelerate-metadata-heavy-workloads-with-metadata-caching-preview-for-azure-prem/4261442" rel="noopener noreferrer"&gt;added metadata caching&lt;/a&gt; for SMB workloads and observed up to 55% latency reduction and 2-3x more consistent response times for workloads with frequent metadata access.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Same pattern shows up everywhere: metadata is small, but accessing it repeatedly at high frequency creates real overhead. Solution is always the same: cache in memory and skip filesystem entirely on the hot path.&lt;/p&gt;




&lt;h2&gt;
  
  
  Solution: three-tier caching
&lt;/h2&gt;

&lt;p&gt;I ended up with three caching layers. Here's the description.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0xaazictnmul41977fdp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0xaazictnmul41977fdp.png" alt="3-tier-caching" width="620" height="520"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Layer 1: Global In-Memory Cache&lt;/strong&gt; — &lt;code&gt;lazy_static&lt;/code&gt; + &lt;code&gt;Mutex&amp;lt;HashMap&amp;gt;&lt;/code&gt;, shared across all JNI calls for the lifetime of the process. Handles the vast majority of operations. Schemas rarely change, so caching them costs almost nothing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Layer 2: Local Disk Cache&lt;/strong&gt; — &lt;code&gt;_meta/schema.json&lt;/code&gt; per table, updated asynchronously. Covers cold-start recovery. When the process restarts, I skip re-reading the original metadata file and go straight to disk cache.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Layer 3: Vortex File Inference&lt;/strong&gt; — Read schema directly from data file headers, merging if multiple versions exist. The fallback that always works. Even without &lt;code&gt;vine_meta.json&lt;/code&gt;, I can infer schema from Vortex data files.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Implementation: global cache layer
&lt;/h2&gt;

&lt;p&gt;At JNI boundary between Spark and Rust, I can share a single cache across all operations. That's what makes the difference.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="nd"&gt;lazy_static!&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="k"&gt;ref&lt;/span&gt; &lt;span class="n"&gt;WRITER_CACHE&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Mutex&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;HashMap&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;WriterCache&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
        &lt;span class="nn"&gt;Mutex&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;HashMap&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;get_writer_metadata&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Metadata&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;cache&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;WRITER_CACHE&lt;/span&gt;&lt;span class="nf"&gt;.lock&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.unwrap&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nf"&gt;Some&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cached&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="nf"&gt;.get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cached&lt;/span&gt;&lt;span class="py"&gt;.metadata&lt;/span&gt;&lt;span class="nf"&gt;.clone&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;  &lt;span class="c1"&gt;// cache hit - no filesystem access&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;writer_cache&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;WriterCache&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="nf"&gt;.into&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;writer_cache&lt;/span&gt;&lt;span class="py"&gt;.metadata&lt;/span&gt;&lt;span class="nf"&gt;.clone&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="nf"&gt;.insert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="nf"&gt;.to_string&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;writer_cache&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A few things worth noting: &lt;code&gt;lazy_static!&lt;/code&gt; creates the cache exactly once; &lt;code&gt;Mutex&amp;lt;HashMap&amp;gt;&lt;/code&gt; is simpler than &lt;code&gt;RwLock&lt;/code&gt; and sufficient for my access patterns; metadata is small (~1KB), so cloning is cheaper than reference counting overhead.&lt;/p&gt;

&lt;p&gt;The writer path then becomes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;write_batch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_writer_metadata&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="py"&gt;.path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  &lt;span class="c1"&gt;// in-memory lookup&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="py"&gt;.config.validate_every_write&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nf"&gt;validate_schema_match&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="nf"&gt;write_vortex_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(())&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;WriterConfig&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;enable_metadata_cache&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="c1"&gt;// Default: true&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;require_metadata_file&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="c1"&gt;// Default: true (strict mode)&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;validate_every_write&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="c1"&gt;// Default: false (for performance)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I validate once on the first write, then trust cache after that. For streaming workloads where schemas are stable, per-write validation isn't worth the cost.&lt;/p&gt;




&lt;h2&gt;
  
  
  Implementation: schema-on-read fallback
&lt;/h2&gt;

&lt;p&gt;For readers, I need more flexibility. A missing metadata file shouldn't crash the read path.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;new_with_fallback&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;PathBuf&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="k"&gt;Self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Error&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;meta_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;base_path&lt;/span&gt;&lt;span class="nf"&gt;.join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"vine_meta.json"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// Strategy 1: vine_meta.json (explicit schema - fastest)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;meta_path&lt;/span&gt;&lt;span class="nf"&gt;.exists&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;Self&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base_path&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;cache_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;base_path&lt;/span&gt;&lt;span class="nf"&gt;.join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"_meta/schema.json"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// Strategy 2: Cached schema (disk cache - fast)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cache_path&lt;/span&gt;&lt;span class="nf"&gt;.exists&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;Metadata&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;cache_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;Self&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;base_path&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;// Strategy 3: Infer from Vortex files (slowest but always works)&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;Metadata&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;infer_from_vortex&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;base_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="c1"&gt;// Cache asynchronously for next read&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;metadata_clone&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cache_path_clone&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="nf"&gt;.clone&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;cache_path&lt;/span&gt;&lt;span class="nf"&gt;.clone&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;
    &lt;span class="nn"&gt;std&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;thread&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;spawn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;move&lt;/span&gt; &lt;span class="p"&gt;||&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;metadata_clone&lt;/span&gt;&lt;span class="nf"&gt;.save_to_cache&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;cache_path_clone&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

    &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;Self&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;base_path&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That async cache write mattered more than expected. Waiting for disk write was adding unnecessary latency to first read. A background thread handles it while I return results immediately—and if it fails, I just re-infer on the next miss.&lt;/p&gt;

&lt;p&gt;Schema inference reads only file headers—not full data rows—and stops early once a complete schema is found:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="n"&gt;infer_from_vortex&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;P&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;AsRef&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;P&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Metadata&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;all_schemas&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;Vec&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;date_dir&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;find_date_directories&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;base_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;vortex_file&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;find_vortex_files&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;date_dir&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;read_vortex_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;vortex_file&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
            &lt;span class="n"&gt;all_schemas&lt;/span&gt;&lt;span class="nf"&gt;.push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;dtype_to_metadata&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"inferred_table"&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="nf"&gt;merge_schemas&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;all_schemas&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;// union of all fields seen&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Implementation: handling schema mismatches
&lt;/h2&gt;

&lt;p&gt;Readers must handle the case where the &lt;strong&gt;expected schema&lt;/strong&gt; (from metadata) doesn't match the &lt;strong&gt;actual schema&lt;/strong&gt; (from data files). Instead of failing, I use lenient matching: iterate over expected fields, extract value if present, fill with a type-appropriate default if not.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;array_to_csv_rows_lenient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;StructArray&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;expected_schema&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;Metadata&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;Vec&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;Vec&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;row_idx&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;..&lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="nf"&gt;.len&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Vec&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;expected_schema&lt;/span&gt;&lt;span class="py"&gt;.fields&lt;/span&gt;&lt;span class="nf"&gt;.iter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.map&lt;/span&gt;&lt;span class="p"&gt;(|&lt;/span&gt;&lt;span class="n"&gt;field&lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;match&lt;/span&gt; &lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="nf"&gt;.field_by_name&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;field&lt;/span&gt;&lt;span class="py"&gt;.name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="nf"&gt;Some&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;extract_value&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;row_idx&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="nb"&gt;None&lt;/span&gt;      &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;default_value_for_type&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;field&lt;/span&gt;&lt;span class="py"&gt;.data_type&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;&lt;span class="nf"&gt;.collect&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
        &lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="nf"&gt;.push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="nf"&gt;.join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;","&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Old readers skip unknown fields; new readers fill missing ones with type-appropriate defaults. Both directions work without coordination.&lt;/p&gt;




&lt;h2&gt;
  
  
  Operational impact
&lt;/h2&gt;

&lt;p&gt;The fallback chain has some useful operational properties worth calling out.&lt;/p&gt;

&lt;p&gt;Schema changes don't require coordination: writer creates new files with an updated schema, readers pick up new fields automatically via inference, no registry to update, no downtime. If &lt;code&gt;vine_meta.json&lt;/code&gt; is accidentally deleted, reads still work (fallback to inference) while writes fail fast—and the metadata can be rebuilt from data files with &lt;code&gt;vine schema rebuild &amp;lt;table_path&amp;gt;&lt;/code&gt;. That's a better failure mode than formats that require metadata for reads (e.g., Delta Lake transaction log).&lt;/p&gt;

&lt;p&gt;Validation overhead also stays bounded: schema is validated once on &lt;code&gt;Writer::new()&lt;/code&gt;, then cached. All subsequent writes in that writer's lifetime skip validation entirely.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I've learned
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Cache at the right granularity
&lt;/h3&gt;

&lt;p&gt;I first tried per-writer instance caching. It helped within a single writer but not enough: Spark creates many short-lived writers (one per partition), and each paid the full cold-start cost.&lt;/p&gt;

&lt;p&gt;Moving cache to global scope meant it survived the writer lifecycle. Speedup came from paying I/O cost once and spreading it across all subsequent operations, not just within a single instance.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Optimize for the common case
&lt;/h3&gt;

&lt;p&gt;Validating schema on every write adds overhead per operation. Validating once adds that overhead once. Since schemas in streaming workloads almost never change between writes, right default is: validate once, trust cache.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Separate write and read semantics
&lt;/h3&gt;

&lt;p&gt;Writers are strict: they require &lt;code&gt;vine_meta.json&lt;/code&gt; at table creation, validate on the first write, and fail fast on mismatches. Readers are lenient: they try the metadata file first, fall back to the disk cache, fall back to file inference, and handle missing fields gracefully. The same semantics don't work for both access patterns.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Use file formats as schema sources
&lt;/h3&gt;

&lt;p&gt;Vortex files (like Parquet) are self-describing—schema lives in file headers. That means no external schema registry to operate, and you can't lose schema without losing data. Files themselves are source of truth.&lt;/p&gt;




&lt;h2&gt;
  
  
  Comparison: why global cache matters
&lt;/h2&gt;

&lt;p&gt;Two approaches side by side:&lt;/p&gt;

&lt;h2&gt;
  
  
  No caching will make filesystem access on every write:
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;write_batch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;Metadata&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"vine_meta.json"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  &lt;span class="c1"&gt;// hit disk EVERY TIME&lt;/span&gt;
    &lt;span class="nf"&gt;write_vortex_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(())&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Global caching makes in-memory lookup, shared across all tasks:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;write_batch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_writer_metadata&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="py"&gt;.path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  &lt;span class="c1"&gt;// HashMap lookup&lt;/span&gt;
    &lt;span class="nf"&gt;write_vortex_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(())&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Cache survives the writer instance lifecycle. First write across any writer pays I/O cost once; everything after is a HashMap lookup.&lt;/p&gt;

&lt;p&gt;Per-instance caching (caching inside a &lt;code&gt;Writer&lt;/code&gt; struct) is the obvious middle ground, but Spark creates &lt;strong&gt;many short-lived writers&lt;/strong&gt; (one per partition)—each new instance would still pay the cold-start cost.&lt;/p&gt;




&lt;h2&gt;
  
  
  Trade-offs and limitations
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What I gave up
&lt;/h3&gt;

&lt;p&gt;Strong schema enforcement: I validate once, not on every write. Schema errors may not surface immediately. There's a configurable &lt;code&gt;validate_every_write&lt;/code&gt; flag for stricter environments.&lt;/p&gt;

&lt;p&gt;Instant schema change detection: writers may briefly use a stale cached schema after a metadata update. File-watching cache invalidation is planned.&lt;/p&gt;

&lt;h3&gt;
  
  
  What I gained
&lt;/h3&gt;

&lt;p&gt;Metadata lookup goes from a filesystem operation to a HashMap lookup on the hot path. No external schema registry means one fewer system to operate. Reads still work if the metadata file goes missing, fallback chains rebuild automatically, and there's no single point of failure.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;A few things I'm planning:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cache invalidation: Watch &lt;code&gt;vine_meta.json&lt;/code&gt; with &lt;code&gt;notify&lt;/code&gt; and invalidate the global cache on file change. Right now a schema update requires a process restart to take effect.&lt;/li&gt;
&lt;li&gt;TTL-based expiration: For long-running processes, stale cache is a real risk. A configurable TTL (default: 1 hour) with background refresh should cover most cases.&lt;/li&gt;
&lt;li&gt;Cache statistics: Hit rate and miss count exposed as metrics for monitoring. Cache effectiveness is currently invisible.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Hope to have a change to describe about these.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Underlying issue was that metadata doesn't change between writes. Every disk load was wasted I/O—not because the file was large, but because syscall overhead and deserialization cost compound at high write rates. Cache globally, validate once, and let file format carry schema when metadata file isn't there.&lt;/p&gt;




&lt;h2&gt;
  
  
  Code repository
&lt;/h2&gt;

&lt;p&gt;Full implementation: &lt;a href="https://github.com/kination/vine" rel="noopener noreferrer"&gt;Vine GitHub Repository&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;vine-core/src/global_cache.rs&lt;/code&gt; — global cache implementation&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;vine-core/src/reader_cache.rs&lt;/code&gt; — schema-on-read fallback chain&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;vine-core/src/metadata.rs&lt;/code&gt; — metadata inference from Vortex files&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://arkanis.de/weblog/2017-01-05-measurements-of-system-call-performance-and-overhead/" rel="noopener noreferrer"&gt;Measurements of system call performance and overhead&lt;/a&gt; — Linux syscall latency benchmarks&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/8/html/monitoring_and_managing_system_status_and_performance/factors-affecting-i-o-and-file-system-performance_monitoring-and-managing-system-status-and-performance" rel="noopener noreferrer"&gt;Factors affecting I/O and file system performance&lt;/a&gt; — Red Hat Enterprise Linux documentation&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.pdl.cmu.edu/PDL-FTP/FS/CMU-PDL-14-103.pdf" rel="noopener noreferrer"&gt;Scaling file system metadata performance&lt;/a&gt; — CMU Parallel Data Lab research paper&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.chaosgenius.io/blog/iceberg-vs-delta-lake-metadata-indexing/" rel="noopener noreferrer"&gt;Iceberg vs Delta Lake metadata indexing&lt;/a&gt; — Metadata comparison between Apache Iceberg and Delta Lake&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://cloud.google.com/bigquery/docs/metadata-caching" rel="noopener noreferrer"&gt;Metadata caching for BigQuery external tables&lt;/a&gt; — Google Cloud documentation&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://techcommunity.microsoft.com/blog/azurestorageblog/accelerate-metadata-heavy-workloads-with-metadata-caching-preview-for-azure-prem/4261442" rel="noopener noreferrer"&gt;Accelerate metadata-heavy workloads with metadata caching&lt;/a&gt; — Microsoft Azure Storage Blog
This is next chapter of following
&lt;a href="https://dev.to/kination/build-my-own-datalake-part-1-367h"&gt;https://dev.to/kination/build-my-own-datalake-part-1-367h&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Problem: Metadata is the Hidden Bottleneck
&lt;/h2&gt;

&lt;p&gt;In high-throughput streaming pipelines, every millisecond counts. We discovered that our initial implementation of Vine (a write-optimized data lake format) was spending &lt;strong&gt;300ms per write&lt;/strong&gt; just reading a tiny JSON metadata file.&lt;/p&gt;

&lt;p&gt;For a system designed to handle &lt;strong&gt;1000+ writes per second&lt;/strong&gt;, this was unacceptable.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Initial Implementation
&lt;/h3&gt;

&lt;p&gt;Like many data lake formats, Vine uses a metadata file to define table schemas:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"table_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"user_events"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"fields"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"user_id"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"data_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"long"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"is_required"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"event_type"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"data_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"string"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"is_required"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here's how this metadata drives Parquet read/write operations in a data lake pattern:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="n"&gt;serde_json&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;std&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;std&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;path&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;parquet&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;file&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;reader&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;FileReader&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;parquet&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;file&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;writer&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;FileWriter&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;// 1. Load metadata from vine_meta.json&lt;/span&gt;
&lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;load_metadata&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Metadata&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;meta_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"vine_meta.json"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;read_to_string&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;meta_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Metadata&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;serde_json&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;from_str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// 2. Read Parquet files using the schema from metadata&lt;/span&gt;
&lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;read_user_events&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;Vec&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;UserEvent&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_metadata&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  &lt;span class="c1"&gt;// 80ms - loads schema&lt;/span&gt;

    &lt;span class="c1"&gt;// Scan date-partitioned directories (2024-01-24/, 2024-01-25/, ...)&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;all_events&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;Vec&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;date_dir&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;find_date_directories&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;parquet_file&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;find_parquet_files&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;date_dir&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="c1"&gt;// Read Parquet file with schema validation&lt;/span&gt;
            &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;file&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;File&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parquet_file&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
            &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;reader&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;parquet&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;file&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;reader&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;SerializedFileReader&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

            &lt;span class="c1"&gt;// Validate schema matches metadata&lt;/span&gt;
            &lt;span class="nf"&gt;validate_schema&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;reader&lt;/span&gt;&lt;span class="nf"&gt;.metadata&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.file_metadata&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.schema&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

            &lt;span class="c1"&gt;// Read rows&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;row_group&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="n"&gt;reader&lt;/span&gt;&lt;span class="nf"&gt;.get_row_iter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;parse_row&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row_group&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
                &lt;span class="n"&gt;all_events&lt;/span&gt;&lt;span class="nf"&gt;.push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;all_events&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// 3. Write Parquet files with date partitioning (data lake pattern)&lt;/span&gt;
&lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;write_user_events&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Vec&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;UserEvent&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_metadata&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  &lt;span class="c1"&gt;// 80ms - loads schema&lt;/span&gt;

    &lt;span class="c1"&gt;// Create date-partitioned output path&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;today&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;chrono&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;Utc&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"%Y-%m-%d"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.to_string&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;date_dir&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;today&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nn"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;create_dir_all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;date_dir&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="c1"&gt;// Generate filename with microsecond precision&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;timestamp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;chrono&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;Utc&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"%H%M%S_%f"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.to_string&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;output_file&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;date_dir&lt;/span&gt;&lt;span class="nf"&gt;.join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nd"&gt;format!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"data_{}.parquet"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;

    &lt;span class="c1"&gt;// Convert metadata to Parquet schema&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;parquet_schema&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;metadata_to_parquet_schema&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// Write Parquet file&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;file&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;File&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output_file&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;writer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;parquet&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;file&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;writer&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;SerializedFileWriter&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;parquet_schema&lt;/span&gt;&lt;span class="nf"&gt;.clone&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="nn"&gt;Default&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;default&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="c1"&gt;// Write rows using schema from metadata&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;event_to_parquet_row&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="n"&gt;writer&lt;/span&gt;&lt;span class="nf"&gt;.write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;writer&lt;/span&gt;&lt;span class="nf"&gt;.close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(())&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The naive approach:&lt;/strong&gt; Read &lt;code&gt;vine_meta.json&lt;/code&gt; on every write operation.&lt;/p&gt;

&lt;p&gt;This was a classic case of premature I/O. We were doing disk reads for data that &lt;strong&gt;never changed&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Solution: Three-Tier Caching Strategy
&lt;/h2&gt;

&lt;p&gt;We implemented an aggressive caching strategy that caches metadata at three levels:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────┐
│ Layer 1: Global In-Memory Cache        │
│  - lazy_static + Mutex&amp;lt;HashMap&amp;gt;         │
│  - Shared across ALL JNI calls          │
│  - Lifetime: Process lifetime           │
│  - Lookup time: 0.88ms                  │
└─────────────────────────────────────────┘
              ↓ (cache miss)
┌─────────────────────────────────────────┐
│ Layer 2: Local Disk Cache              │
│  - _meta/schema.json per table          │
│  - Updated asynchronously               │
│  - Lookup time: 10-20ms                 │
└─────────────────────────────────────────┘
              ↓ (cache miss)
┌─────────────────────────────────────────┐
│ Layer 3: Vortex File Inference         │
│  - Read schema from data files          │
│  - Merge if multiple versions           │
│  - Lookup time: 50-80ms                 │
│  - Always works (schema-on-read)        │
└─────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Why Three Layers?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Layer 1 (Global Cache):&lt;/strong&gt; The fast path for 99.9% of operations. Since schemas rarely change, we cache them in memory for the lifetime of the process.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 2 (Disk Cache):&lt;/strong&gt; Enables fast cold-start recovery. When the process restarts, we don't need to re-read the original metadata file or infer from data files.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 3 (File Inference):&lt;/strong&gt; The ultimate fallback. Even if &lt;code&gt;vine_meta.json&lt;/code&gt; is missing or corrupted, we can always infer the schema from Vortex data files themselves.&lt;/p&gt;




&lt;h2&gt;
  
  
  Implementation: Global Cache Layer
&lt;/h2&gt;

&lt;p&gt;The key breakthrough was realizing that in the JNI boundary between Spark and Rust, we could share a &lt;strong&gt;global cache&lt;/strong&gt; across all operations.&lt;/p&gt;

&lt;h3&gt;
  
  
  Global Cache Implementation
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;lazy_static&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;lazy_static&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;std&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;collections&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;HashMap&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;std&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;sync&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Mutex&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="nd"&gt;lazy_static!&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="k"&gt;ref&lt;/span&gt; &lt;span class="n"&gt;READER_CACHE&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Mutex&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;HashMap&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ReaderCache&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
        &lt;span class="nn"&gt;Mutex&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;HashMap&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;

    &lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="k"&gt;ref&lt;/span&gt; &lt;span class="n"&gt;WRITER_CACHE&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Mutex&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;HashMap&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;WriterCache&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
        &lt;span class="nn"&gt;Mutex&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;HashMap&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;get_writer_metadata&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Metadata&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;cache&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;WRITER_CACHE&lt;/span&gt;&lt;span class="nf"&gt;.lock&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.unwrap&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

    &lt;span class="c1"&gt;// Fast path: Check global cache first&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nf"&gt;Some&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cached&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="nf"&gt;.get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cached&lt;/span&gt;&lt;span class="py"&gt;.metadata&lt;/span&gt;&lt;span class="nf"&gt;.clone&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;  &lt;span class="c1"&gt;// 0.88ms - cache hit!&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;// Slow path: Load from disk and cache&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;writer_cache&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;WriterCache&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="nf"&gt;.into&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;writer_cache&lt;/span&gt;&lt;span class="py"&gt;.metadata&lt;/span&gt;&lt;span class="nf"&gt;.clone&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="nf"&gt;.insert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="nf"&gt;.to_string&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;writer_cache&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Key insights:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Lazy initialization:&lt;/strong&gt; Use &lt;code&gt;lazy_static!&lt;/code&gt; to create a global cache that's initialized once&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fine-grained locking:&lt;/strong&gt; Use &lt;code&gt;Mutex&amp;lt;HashMap&amp;gt;&lt;/code&gt; instead of &lt;code&gt;RwLock&lt;/code&gt; (simpler, sufficient for our access patterns)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Clone is cheap:&lt;/strong&gt; Metadata is small (~1KB), cloning is faster than reference counting overhead&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Writer Path with Global Cache
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;write_batch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Fast path: Use cached metadata (97x faster!)&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="py"&gt;.cached_metadata&lt;/span&gt;&lt;span class="nf"&gt;.as_ref&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="nf"&gt;.ok_or&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Metadata cache not initialized"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="c1"&gt;// Optional: Validate schema (disabled by default for performance)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="py"&gt;.config.validate_every_write&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nf"&gt;validate_schema_match&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;// Write to Vortex with schema&lt;/span&gt;
    &lt;span class="nf"&gt;write_vortex_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(())&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Configuration options:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;WriterConfig&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;enable_metadata_cache&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="c1"&gt;// Default: true&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;require_metadata_file&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="c1"&gt;// Default: true (strict mode)&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;validate_every_write&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="c1"&gt;// Default: false (for performance)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By default, we &lt;strong&gt;validate once&lt;/strong&gt; (on cache miss) and then &lt;strong&gt;trust the cache&lt;/strong&gt; for all subsequent writes. This trades off strict validation for performance—a worthwhile trade for streaming workloads where schemas are stable.&lt;/p&gt;




&lt;h2&gt;
  
  
  Implementation: Schema-on-Read Fallback
&lt;/h2&gt;

&lt;p&gt;For readers, we need more flexibility. A missing metadata file shouldn't crash the read path.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fallback Chain Implementation
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;new_with_fallback&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;PathBuf&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="k"&gt;Self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Error&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;meta_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;base_path&lt;/span&gt;&lt;span class="nf"&gt;.join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"vine_meta.json"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// Strategy 1: vine_meta.json (explicit schema - fastest)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;meta_path&lt;/span&gt;&lt;span class="nf"&gt;.exists&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;Self&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base_path&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  &lt;span class="c1"&gt;// Load from JSON (~10ms)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;cache_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;base_path&lt;/span&gt;&lt;span class="nf"&gt;.join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"_meta/schema.json"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// Strategy 2: Cached schema (disk cache - fast)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cache_path&lt;/span&gt;&lt;span class="nf"&gt;.exists&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;Metadata&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;cache_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;Self&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;base_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;});&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;// Strategy 3: Infer from Vortex files (slowest but always works)&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;Metadata&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;infer_from_vortex&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;base_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="c1"&gt;// Cache asynchronously for next read&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;metadata_clone&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="nf"&gt;.clone&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;cache_path_clone&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cache_path&lt;/span&gt;&lt;span class="nf"&gt;.clone&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="nn"&gt;std&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;thread&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;spawn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;move&lt;/span&gt; &lt;span class="p"&gt;||&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;metadata_clone&lt;/span&gt;&lt;span class="nf"&gt;.save_to_cache&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;cache_path_clone&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;

    &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;Self&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;base_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why asynchronous caching?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We discovered that waiting for the cache write (5-10ms) was adding unnecessary latency to the first read. By spawning a background thread, we:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Return results to the user immediately&lt;/li&gt;
&lt;li&gt;Cache the schema for the next operation&lt;/li&gt;
&lt;li&gt;Avoid blocking on disk I/O&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is safe because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cache writes are idempotent&lt;/li&gt;
&lt;li&gt;Cache is only a performance optimization, not required for correctness&lt;/li&gt;
&lt;li&gt;Worst case: We re-infer schema on next cache miss&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Schema Inference from Vortex Files
&lt;/h3&gt;

&lt;p&gt;When all else fails, we can &lt;strong&gt;always&lt;/strong&gt; read the schema directly from Vortex data files:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="n"&gt;infer_from_vortex&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;P&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;AsRef&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;P&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Metadata&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;all_schemas&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;Vec&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

    &lt;span class="c1"&gt;// Scan date-partitioned directories (YYYY-MM-DD/)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;date_dir&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;find_date_directories&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;base_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;vortex_file&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;find_vortex_files&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;date_dir&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="c1"&gt;// Read Vortex file header (cheap - no full scan needed)&lt;/span&gt;
            &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;read_vortex_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;vortex_file&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
            &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;schema&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;dtype_to_metadata&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"inferred_table"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
            &lt;span class="n"&gt;all_schemas&lt;/span&gt;&lt;span class="nf"&gt;.push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;// Merge all schemas (union of fields)&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;merged&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;merge_schemas&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;all_schemas&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;merged&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Performance characteristics:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reading Vortex headers: &lt;strong&gt;~1-2ms per file&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Schema merging: &lt;strong&gt;~5ms for 100 files&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Total: &lt;strong&gt;50-80ms&lt;/strong&gt; (still faster than cold disk reads on many filesystems)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Optimization:&lt;/strong&gt; We only scan until we find a complete schema. If the first file has all expected fields, we stop early.&lt;/p&gt;




&lt;h2&gt;
  
  
  Implementation: Handling Schema Mismatches
&lt;/h2&gt;

&lt;p&gt;Readers must handle the case where the &lt;strong&gt;expected schema&lt;/strong&gt; (from metadata) doesn't match the &lt;strong&gt;actual schema&lt;/strong&gt; (from data files).&lt;/p&gt;

&lt;h3&gt;
  
  
  Lenient Schema Matching
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;array_to_csv_rows_lenient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;StructArray&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;expected_schema&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;Metadata&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;Vec&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;actual_fields&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="nf"&gt;.field_names&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;expected_fields&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;HashSet&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;expected_schema&lt;/span&gt;&lt;span class="py"&gt;.fields&lt;/span&gt;
        &lt;span class="nf"&gt;.iter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="nf"&gt;.map&lt;/span&gt;&lt;span class="p"&gt;(|&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="py"&gt;.name&lt;/span&gt;&lt;span class="nf"&gt;.as_str&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
        &lt;span class="nf"&gt;.collect&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;Vec&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;row_idx&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;..&lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="nf"&gt;.len&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;values&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;Vec&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;expected_field&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;expected_schema&lt;/span&gt;&lt;span class="py"&gt;.fields&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nf"&gt;Some&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;column&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="nf"&gt;.field_by_name&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;expected_field&lt;/span&gt;&lt;span class="py"&gt;.name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="c1"&gt;// Field exists in data: Extract value&lt;/span&gt;
                &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract_value&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;column&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;row_idx&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
                &lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="nf"&gt;.push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="c1"&gt;// Field missing: Use default/null&lt;/span&gt;
                &lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="nf"&gt;.push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;default_value_for_type&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;expected_field&lt;/span&gt;&lt;span class="py"&gt;.data_type&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="nf"&gt;.push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="nf"&gt;.join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;","&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;default_value_for_type&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data_type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;String&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;match&lt;/span&gt; &lt;span class="n"&gt;data_type&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="s"&gt;"integer"&lt;/span&gt; &lt;span class="p"&gt;|&lt;/span&gt; &lt;span class="s"&gt;"long"&lt;/span&gt; &lt;span class="p"&gt;|&lt;/span&gt; &lt;span class="s"&gt;"byte"&lt;/span&gt; &lt;span class="p"&gt;|&lt;/span&gt; &lt;span class="s"&gt;"short"&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s"&gt;"0"&lt;/span&gt;&lt;span class="nf"&gt;.to_string&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="s"&gt;"float"&lt;/span&gt; &lt;span class="p"&gt;|&lt;/span&gt; &lt;span class="s"&gt;"double"&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s"&gt;"0.0"&lt;/span&gt;&lt;span class="nf"&gt;.to_string&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="s"&gt;"boolean"&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s"&gt;"false"&lt;/span&gt;&lt;span class="nf"&gt;.to_string&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="s"&gt;"string"&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt;&lt;span class="nf"&gt;.to_string&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;  &lt;span class="c1"&gt;// Empty string for missing text&lt;/span&gt;
        &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt;&lt;span class="nf"&gt;.to_string&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This provides:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Backward compatibility:&lt;/strong&gt; Old readers can read new data (ignore unknown fields)&lt;br&gt;
&lt;strong&gt;Forward compatibility:&lt;/strong&gt; New readers can read old data (fill missing fields with defaults)&lt;/p&gt;


&lt;h2&gt;
  
  
  Performance Results
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Benchmark Setup
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Dataset:&lt;/strong&gt; 1M rows, 4 columns (id: int, name: string, age: int, score: double)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hardware:&lt;/strong&gt; M1 Mac, 16GB RAM&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Measurement:&lt;/strong&gt; 100 repeated operations&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Write Performance
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Configuration&lt;/th&gt;
&lt;th&gt;Time (1M rows)&lt;/th&gt;
&lt;th&gt;Throughput&lt;/th&gt;
&lt;th&gt;Speedup&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;No cache&lt;/strong&gt; (baseline)&lt;/td&gt;
&lt;td&gt;12.0s&lt;/td&gt;
&lt;td&gt;83K rows/sec&lt;/td&gt;
&lt;td&gt;1x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;With global cache&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1.8s&lt;/td&gt;
&lt;td&gt;555K rows/sec&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;6.7x&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Breakdown of 12.0s baseline:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;JSON parsing: 8.5s (71%)&lt;/li&gt;
&lt;li&gt;CSV conversion: 2.0s (17%)&lt;/li&gt;
&lt;li&gt;Vortex write: 1.5s (12%)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Breakdown of 1.8s cached:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Metadata lookup: 0.088s (5%)&lt;/li&gt;
&lt;li&gt;CSV conversion: 0.7s (39%)&lt;/li&gt;
&lt;li&gt;Vortex write: 1.0s (56%)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The cache eliminated &lt;strong&gt;8.4 seconds&lt;/strong&gt; of pure JSON parsing overhead!&lt;/p&gt;
&lt;h3&gt;
  
  
  Read Performance
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Configuration&lt;/th&gt;
&lt;th&gt;Time (100 calls)&lt;/th&gt;
&lt;th&gt;Per-call&lt;/th&gt;
&lt;th&gt;Speedup&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;No cache&lt;/strong&gt; (baseline)&lt;/td&gt;
&lt;td&gt;8000ms&lt;/td&gt;
&lt;td&gt;80ms/call&lt;/td&gt;
&lt;td&gt;1x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;With global cache&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;88ms&lt;/td&gt;
&lt;td&gt;0.88ms/call&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;91x&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Per-call breakdown (no cache):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;File open: 15ms&lt;/li&gt;
&lt;li&gt;JSON parse: 50ms&lt;/li&gt;
&lt;li&gt;Metadata object creation: 15ms&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Per-call breakdown (cached):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;HashMap lookup: 0.3ms&lt;/li&gt;
&lt;li&gt;Clone: 0.58ms&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Memory Overhead
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Metadata size&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~1KB per table&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cache overhead&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&amp;lt;1MB for 1000 tables&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Memory amplification&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Negligible (&amp;lt;0.1% of heap)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The cache is essentially free in terms of memory.&lt;/p&gt;


&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;
&lt;h3&gt;
  
  
  1. Cache at the Right Granularity
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Initial attempt:&lt;/strong&gt; Per-writer instance caching&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Helped: Reduced redundant reads within a single writer&lt;/li&gt;
&lt;li&gt;Didn't help: JNI overhead still present for each writer creation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Breakthrough:&lt;/strong&gt; Global cache shared across all JNI calls&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Result: &lt;strong&gt;97x speedup&lt;/strong&gt; because cache survives writer lifecycle&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key insight:&lt;/strong&gt; In high-throughput systems, &lt;strong&gt;amortize I/O across all operations&lt;/strong&gt;, not just within a single instance.&lt;/p&gt;
&lt;h3&gt;
  
  
  2. Optimize for the Common Case
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Observation:&lt;/strong&gt; 99.9% of writes use the same schema.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Initial design:&lt;/strong&gt; Validate schema on every write&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cost: 5-10ms per write&lt;/li&gt;
&lt;li&gt;Benefit: Catch schema errors immediately&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Optimized design:&lt;/strong&gt; Validate once, trust cache&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cost: 5-10ms once (on cache miss)&lt;/li&gt;
&lt;li&gt;Benefit: 0ms for all subsequent writes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key insight:&lt;/strong&gt; &lt;strong&gt;Assume schemas are stable&lt;/strong&gt;, handle evolution as the exception.&lt;/p&gt;
&lt;h3&gt;
  
  
  3. Separate Write and Read Semantics
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Writers (strict mode):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Require &lt;code&gt;vine_meta.json&lt;/code&gt; at table creation&lt;/li&gt;
&lt;li&gt;Validate schema on first write&lt;/li&gt;
&lt;li&gt;Fail fast on mismatches&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Readers (lenient mode):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Try &lt;code&gt;vine_meta.json&lt;/code&gt; first&lt;/li&gt;
&lt;li&gt;Fallback to cache&lt;/li&gt;
&lt;li&gt;Fallback to file inference&lt;/li&gt;
&lt;li&gt;Handle missing fields gracefully&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key insight:&lt;/strong&gt; &lt;strong&gt;Different access patterns need different guarantees&lt;/strong&gt;. Don't force one-size-fits-all.&lt;/p&gt;
&lt;h3&gt;
  
  
  4. Use File Formats as Schema Sources
&lt;/h3&gt;

&lt;p&gt;Vortex (like Parquet) files are &lt;strong&gt;self-describing&lt;/strong&gt;. The schema is embedded in the file header.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Implication:&lt;/strong&gt; We don't need an external schema registry. The data files themselves are the source of truth.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Benefit:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Resilience (can't lose schema if you have the data)&lt;/li&gt;
&lt;li&gt;Simplicity (one less system to operate)&lt;/li&gt;
&lt;li&gt;Performance (schema is co-located with data)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key insight:&lt;/strong&gt; &lt;strong&gt;Modern columnar formats are self-describing&lt;/strong&gt;. Trust the format, don't duplicate metadata.&lt;/p&gt;


&lt;h2&gt;
  
  
  Operational Impact
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Zero-Downtime Schema Changes
&lt;/h3&gt;

&lt;p&gt;Because readers use a fallback chain, we can update schemas without coordination:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Writer creates new files with updated schema&lt;/li&gt;
&lt;li&gt;Readers detect new fields automatically via inference&lt;/li&gt;
&lt;li&gt;No global schema registry to update&lt;/li&gt;
&lt;li&gt;No downtime required&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;
  
  
  Resilient to Metadata Loss
&lt;/h3&gt;

&lt;p&gt;If &lt;code&gt;vine_meta.json&lt;/code&gt; is accidentally deleted:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Reads still work (fallback to inference)&lt;/li&gt;
&lt;li&gt;Writes fail (strict mode requirement)&lt;/li&gt;
&lt;li&gt;Rebuild metadata from data files: &lt;code&gt;vine schema rebuild &amp;lt;table_path&amp;gt;&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is much better than formats that &lt;strong&gt;require&lt;/strong&gt; metadata for reads (e.g., Delta Lake transaction log).&lt;/p&gt;
&lt;h3&gt;
  
  
  Write-Optimized Validation
&lt;/h3&gt;

&lt;p&gt;Schema validation happens &lt;strong&gt;once per writer instance&lt;/strong&gt;, not once per write:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="c1"&gt;// First write: Validate schema&lt;/span&gt;
&lt;span class="nn"&gt;Writer&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Loads&lt;/span&gt; &lt;span class="nf"&gt;metadata&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="n"&gt;ms&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;validates&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;caches&lt;/span&gt;

&lt;span class="c1"&gt;// Subsequent writes: Trust cache&lt;/span&gt;
&lt;span class="n"&gt;writer&lt;/span&gt;&lt;span class="nf"&gt;.write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Uses&lt;/span&gt; &lt;span class="n"&gt;cached&lt;/span&gt; &lt;span class="nf"&gt;metadata&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.88&lt;/span&gt;&lt;span class="n"&gt;ms&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Validation overhead:&lt;/strong&gt; &amp;lt;1% of total write time (only on first write).&lt;/p&gt;




&lt;h2&gt;
  
  
  Comparison: Why Global Cache Matters
&lt;/h2&gt;

&lt;p&gt;Let's compare different caching strategies:&lt;/p&gt;

&lt;h3&gt;
  
  
  No Caching (Baseline)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;write_batch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;Metadata&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"vine_meta.json"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  &lt;span class="c1"&gt;// 80ms EVERY TIME&lt;/span&gt;
    &lt;span class="nf"&gt;write_vortex_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(())&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Performance:&lt;/strong&gt; 300ms per write&lt;/p&gt;

&lt;h3&gt;
  
  
  Per-Instance Caching
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;Writer&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;cached_metadata&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Option&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Metadata&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;write_batch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="py"&gt;.cached_metadata&lt;/span&gt;&lt;span class="nf"&gt;.is_none&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="py"&gt;.cached_metadata&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;Some&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;Metadata&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"vine_meta.json"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="py"&gt;.cached_metadata&lt;/span&gt;&lt;span class="nf"&gt;.as_ref&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.unwrap&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="nf"&gt;write_vortex_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(())&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Performance:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;First write: 300ms&lt;/li&gt;
&lt;li&gt;Subsequent writes: 3ms&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Problem:&lt;/strong&gt; Every new writer instance pays 300ms cost&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In Spark, we create &lt;strong&gt;many short-lived writer instances&lt;/strong&gt; (one per partition). Per-instance caching helped, but not enough.&lt;/p&gt;

&lt;h3&gt;
  
  
  Global Caching (Our Approach)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="nd"&gt;lazy_static!&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="k"&gt;ref&lt;/span&gt; &lt;span class="n"&gt;WRITER_CACHE&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Mutex&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;HashMap&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;WriterCache&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
        &lt;span class="nn"&gt;Mutex&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;HashMap&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;write_batch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_writer_metadata&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="py"&gt;.path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  &lt;span class="c1"&gt;// 0.88ms from global cache&lt;/span&gt;
    &lt;span class="nf"&gt;write_vortex_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(())&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Performance:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;First write (any writer): 300ms (loads and caches)&lt;/li&gt;
&lt;li&gt;All subsequent writes (any writer): 3ms&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Win:&lt;/strong&gt; Cache survives writer instance lifecycle&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Speedup:&lt;/strong&gt; 97x improvement for steady-state writes.&lt;/p&gt;




&lt;h2&gt;
  
  
  Trade-offs and Limitations
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What We Gave Up
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. Strong Schema Enforcement&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pro: Fast writes&lt;/li&gt;
&lt;li&gt;Con: May not catch schema errors immediately&lt;/li&gt;
&lt;li&gt;Mitigation: Optional per-write validation (configurable)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. Instant Schema Change Detection&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pro: Zero coordination overhead&lt;/li&gt;
&lt;li&gt;Con: Writers may use stale cached schema&lt;/li&gt;
&lt;li&gt;Mitigation: Cache invalidation on metadata file update (planned)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. Fine-Grained Version Control&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pro: Simple implementation&lt;/li&gt;
&lt;li&gt;Con: No explicit schema history tracking&lt;/li&gt;
&lt;li&gt;Mitigation: Version log (planned for v0.4.0)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  What We Gained
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. Write Latency&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;97x faster metadata access&lt;/li&gt;
&lt;li&gt;6.7x end-to-end write throughput&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. Operational Simplicity&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No external schema registry to operate&lt;/li&gt;
&lt;li&gt;Self-describing data files&lt;/li&gt;
&lt;li&gt;Automatic fallback chains&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. Resilience&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reads work even if metadata is missing&lt;/li&gt;
&lt;li&gt;Automatic cache rebuilding&lt;/li&gt;
&lt;li&gt;No single point of failure&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Future Optimizations
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Short-Term (v0.3.0)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Cache Invalidation:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Watch vine_meta.json for changes&lt;/span&gt;
&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;watch_metadata_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;watcher&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;notify&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;watcher&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nn"&gt;Duration&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;from_secs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;watcher&lt;/span&gt;&lt;span class="nf"&gt;.watch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nn"&gt;RecursiveMode&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;NonRecursive&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="c1"&gt;// Invalidate cache on file change&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rx&lt;/span&gt;&lt;span class="nf"&gt;.recv&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nf"&gt;invalidate_cache&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Cache Warming:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Pre-load frequently used tables into cache&lt;/span&gt;
&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;warm_cache&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table_paths&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="n"&gt;table_paths&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_reader_metadata&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  &lt;span class="c1"&gt;// Load into global cache&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Medium-Term (v0.4.0)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;TTL-Based Cache Expiration:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;CachedMetadata&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Metadata&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;loaded_at&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;SystemTime&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;ttl&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Duration&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;// Default: 1 hour&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;impl&lt;/span&gt; &lt;span class="n"&gt;CachedMetadata&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;is_expired&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nn"&gt;SystemTime&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="nf"&gt;.duration_since&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="py"&gt;.loaded_at&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="nf"&gt;.unwrap&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="py"&gt;.ttl&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Cache Statistics:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;CacheStats&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;hits&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;AtomicU64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;misses&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;AtomicU64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;evictions&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;AtomicU64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// Expose metrics for monitoring&lt;/span&gt;
&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;get_cache_hit_rate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;f64&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;hits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;CACHE_STATS&lt;/span&gt;&lt;span class="py"&gt;.hits&lt;/span&gt;&lt;span class="nf"&gt;.load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;Ordering&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Relaxed&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;misses&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;CACHE_STATS&lt;/span&gt;&lt;span class="py"&gt;.misses&lt;/span&gt;&lt;span class="nf"&gt;.load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;Ordering&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Relaxed&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;hits&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nb"&gt;f64&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hits&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;misses&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nb"&gt;f64&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;By implementing a three-tier caching strategy with a global cache at the JNI boundary, we achieved:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;97x faster&lt;/strong&gt; metadata lookups (80ms → 0.88ms)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;6.7x faster&lt;/strong&gt; end-to-end writes (12s → 1.8s for 1M rows)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zero operational overhead&lt;/strong&gt; (no external systems required)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resilient fallbacks&lt;/strong&gt; (schema-on-read always works)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key lessons:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Cache at the right level:&lt;/strong&gt; Global &amp;gt; Per-instance &amp;gt; No cache&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Optimize for the common case:&lt;/strong&gt; Schemas rarely change&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Separate read/write semantics:&lt;/strong&gt; Strict writes, lenient reads&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trust file formats:&lt;/strong&gt; Self-describing data &amp;gt; external schemas&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For write-optimized data lakes, metadata access should be &lt;strong&gt;invisible&lt;/strong&gt;. Every millisecond spent on schema lookups is a millisecond stolen from actual data processing.&lt;/p&gt;




&lt;h2&gt;
  
  
  Code Repository
&lt;/h2&gt;

&lt;p&gt;Full implementation available at: &lt;a href="https://github.com/kination/vine" rel="noopener noreferrer"&gt;Vine GitHub Repository&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key files:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;vine-core/src/global_cache.rs&lt;/code&gt; - Global cache implementation&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;vine-core/src/reader_cache.rs&lt;/code&gt; - Schema-on-read fallback chain&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;vine-core/src/metadata.rs&lt;/code&gt; - Metadata inference from Vortex files&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>bigdata</category>
      <category>metadata</category>
      <category>rust</category>
      <category>buildmyownx</category>
    </item>
    <item>
      <title>Fundamental matters more in AI era</title>
      <dc:creator>kination</dc:creator>
      <pubDate>Sat, 14 Feb 2026 12:36:03 +0000</pubDate>
      <link>https://forem.com/kination/fundamental-matters-more-in-ai-era-4nkf</link>
      <guid>https://forem.com/kination/fundamental-matters-more-in-ai-era-4nkf</guid>
      <description>&lt;p&gt;Large Language Models (LLMs) have become an inseparable part of the modern computing landscape. Software development is no exception. We’ve moved past the stage where AI only generates simple boilerplate; today, LLMs are capable of implementing complex logic and architecting entire applications.&lt;/p&gt;

&lt;p&gt;Naturally, this raises a pressing question for those of us in the industry: Is there a future for software developers, or are we being phased out?&lt;/p&gt;

&lt;h2&gt;
  
  
  Shift is Real
&lt;/h2&gt;

&lt;p&gt;Some might argue that "no-code" movements have always tried to replace developers, only to fail because a professional is always eventually needed. However, the current shift feels different. This is the most significant paradigm shift I’ve witnessed since I started earning a living through code.&lt;/p&gt;

&lt;p&gt;Ignoring LLMs in favor of "pure" manual coding will soon lead to a dead end. But there is a flip side: if you simply rely on LLMs to write code while acting as a mere "code checker," you become easily replaceable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two Paths Forward: Strategy in the Age of Autopilot
&lt;/h2&gt;

&lt;p&gt;To stay relevant, developers must choose distinct paths. The middle ground is rapidly disappearing.&lt;/p&gt;

&lt;p&gt;The Architect of Experience : Use AI as a force multiplier to solve human problems. Your value lies in how quickly you can integrate APIs and LLMs to build trendy, user centric services.&lt;/p&gt;

&lt;p&gt;The Architect of Systems (The Fundamentalist): Focus on the "under-the-hood" mechanics. As LLMs flood the world with abstraction, we need engineers who understand why a system fails under high concurrency or why a memory management strategy is suboptimal.&lt;/p&gt;

&lt;p&gt;I recently came across a blog post that perfectly encapsulates these sentiments. To be honest, this article has been motivated from here.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://notes.eatonphil.com/2026-01-19-llms-and-your-career.html" rel="noopener noreferrer"&gt;https://notes.eatonphil.com/2026-01-19-llms-and-your-career.html&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Don't Just Accept -&amp;gt; Validate
&lt;/h2&gt;

&lt;p&gt;The greatest trap of the LLM era is complacency. We must not settle for what the LLM hands us. Think of it this way: the time you've saved by not having to manually search through documentation should be reinvested into verifying and understanding the generated logic. When you combine your hard-earned experience with the efficiency of an LLM, your competitiveness doesn't just increase—it multiplies. You aren't just a consumer of AI; you are its auditor and architect.&lt;/p&gt;

&lt;h2&gt;
  
  
  Opening the Black Box: Back to the Metal
&lt;/h2&gt;

&lt;p&gt;The "Black Box" problem is a silent threat. In the data world, we stand on the shoulders of giants like Kubernetes, Spark, Flink, and Airflow. Yet, very few engineers understand these tools beyond their documentation.&lt;/p&gt;

&lt;p&gt;We must remember a fundamental truth: No matter how sophisticated the AI or how complex the technology, everything eventually runs on CPU, Memory, and Network. When a critical issue occurs in production, the root cause is almost always found within these three pillars. Engineers who can bridge the gap between high level AI abstractions and these fundamental hardware constraints will never be out of demand.&lt;/p&gt;

&lt;h2&gt;
  
  
  The "Build Your Own X" Philosophy
&lt;/h2&gt;

&lt;p&gt;This is exactly why I’ve been focusing on "build-your-own-x" projects. The goal is to bridge the gap between "using" a tool and "understanding" its core principles.&lt;/p&gt;

&lt;p&gt;Interestingly, you don't need to go back to school to do this. Your LLM is the ultimate mentor. A modern LLM can drastically reduce the time it takes to grasp complex internal architectures. It shouldn't just write your code; it should explain the "why" behind the logic, acting as a teacher that uncovers what you didn't even know you were missing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stay Curious
&lt;/h2&gt;

&lt;p&gt;While headlines talk about developer layoffs, there is still an immense demand for engineers who possess deep technical intuition and an insatiable curiosity.&lt;/p&gt;

&lt;p&gt;The AI era doesn't mean we study less; it means we must study deeper. Even if LLMs handle the bulk of the coding, services built for humans will always require people who understand the soul of the machine. Don't let the convenience of AI stifle your technical curiosity—use that convenience to fuel it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reference
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://notes.eatonphil.com/2026-01-19-llms-and-your-career.html" rel="noopener noreferrer"&gt;https://notes.eatonphil.com/2026-01-19-llms-and-your-career.html&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>llm</category>
      <category>software</category>
      <category>developer</category>
    </item>
    <item>
      <title>Story of 'smoodit' (1) : Electron to Tauri</title>
      <dc:creator>kination</dc:creator>
      <pubDate>Sat, 24 Jan 2026 08:39:15 +0000</pubDate>
      <link>https://forem.com/kination/story-of-smoodit-1-electron-to-tauri-3n7e</link>
      <guid>https://forem.com/kination/story-of-smoodit-1-electron-to-tauri-3n7e</guid>
      <description>&lt;p&gt;This article documents the story behind the development of text editor named as &lt;code&gt;smoodit&lt;/code&gt;, and the lessons learned along the way.&lt;/p&gt;

&lt;h2&gt;
  
  
  Plan, and starting
&lt;/h2&gt;

&lt;p&gt;The primary objective which I've plan was to implement  private-based text editor desktop application with assistant, to boost editing efficiency by offering predictive text capabilities that anticipate user's needs.&lt;/p&gt;

&lt;p&gt;On making PoC, I started with Electron platform, which is close to the industry standard. I initially thought about calling external LLM(from ChatGPT, Gemini, Claude...) APIs, but after thinking about offline mode and cost management, I decided to go with an embedded model. After evaluating a few methods, I settled on "Ollama" because of how easily it could be integrated into the workflow.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why I think of migrating from Electron to Tauri
&lt;/h2&gt;

&lt;p&gt;Electron has been the go-to for desktop apps for years, but its massive resource footprint (thanks to Chromium) started weighing down my project. My application needed to run  &lt;code&gt;Python backend&lt;/code&gt; and &lt;code&gt;Local LLM engine (Ollama)&lt;/code&gt; simultaneously. The requirement to bundle an LLM significantly increased the application's footprint, depending on the model used. This made me much more conscious of the bundle size, which eventually became one of the primary catalysts for my decision to migrate.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftkol2gsdp1xt93pv1d3y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftkol2gsdp1xt93pv1d3y.png" alt=" " width="800" height="310"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For a project where performance and system agility are paramount, &lt;strong&gt;Tauri v2&lt;/strong&gt; emerged as the clear answer to next step.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Frontend&lt;/strong&gt;: React + Vite (leveraging Tauri v2 APIs).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Backend (Sidecar)&lt;/strong&gt;: A FastAPI server packaged into a single binary using PyInstaller.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI Engine (Sidecar)&lt;/strong&gt;: A raw Ollama binary serving local LLMs.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Migration Roadmap
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Phase 1: Mastering the Tauri Sidecar
&lt;/h3&gt;

&lt;p&gt;One of Tauri's powerful features is "Sidecar" — ability to bundle and execute external binaries alongside application core.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Packaging Python: It used &lt;code&gt;pyinstaller&lt;/code&gt; to freeze Python app into a standalone executable and make it run independent.&lt;/li&gt;
&lt;li&gt;Configuration: I registered both &lt;code&gt;backend_server&lt;/code&gt; and the &lt;code&gt;ollama&lt;/code&gt; binaries in &lt;code&gt;src-tauri/tauri.conf.json&lt;/code&gt; under &lt;code&gt;externalBin&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;In Tauri v2, sidecar binaries must include target triple suffix (ex&amp;gt; &lt;code&gt;-aarch64-apple-darwin&lt;/code&gt;) in their filenames to be correctly identified during runtime.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Phase 2: Bridging the Frontend and Sidecars
&lt;/h3&gt;

&lt;p&gt;WebView security policies are very strict about local network requests. &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Instead of standard &lt;code&gt;fetch&lt;/code&gt;, utilized &lt;code&gt;@tauri-apps/plugin-http&lt;/code&gt; plugin. This allows React frontend to bypass CORS issues and speak directly to local FastAPI backend.&lt;/li&gt;
&lt;li&gt;User Experience: Added "Health Check Polling" mechanism. The UI remains in "Initializing" state until the backend sidecar reports status &lt;code&gt;200 OK&lt;/code&gt;, ensuring no requests are lost in the void during startup.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Debugging stories
&lt;/h2&gt;

&lt;p&gt;The most interesting (and stressful) part of any migration is the troubleshooting. Here are several issue I've encountered and how I've fixed.&lt;/p&gt;

&lt;h3&gt;
  
  
  For MacOS' user — Quarantine
&lt;/h3&gt;

&lt;p&gt;When downloading or bundle third-party binaries, MacOS marks them with a "Quarantine" attribute. When Tauri tried to spawn Ollama, it would fail silently without any visible error.&lt;br&gt;
So, I've added a cleanup step in our &lt;code&gt;package.json&lt;/code&gt; build script using &lt;code&gt;xattr -d com.apple.quarantine&lt;/code&gt; to strip these attributes from all bundled binaries before execution.&lt;/p&gt;

&lt;h3&gt;
  
  
  PIPE Buffer Hang
&lt;/h3&gt;

&lt;p&gt;Originally, I've used &lt;code&gt;subprocess.PIPE&lt;/code&gt; to capture Ollama's logs in Python. However, when the log data exceeded system's buffer size, the entire Ollama process would freeze (hang).&lt;br&gt;
For this, I've redirected sidecar output to a dedicated log file at &lt;code&gt;~/ollama_sidecar.log&lt;/code&gt;. This not only prevented the buffer-related hangs but also gave us a persistent way to inspect server logs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Zombie Processes and Lifecycle Management
&lt;/h3&gt;

&lt;p&gt;Also, I've struggled with Ollama instances status staying alive after application closed (zombie processes) because it was running on separate process.&lt;br&gt;
So I've used &lt;code&gt;start_new_session=True&lt;/code&gt; in Python subprocess spawn, to detach child from the parent session. Furthermore, I've implemented socket-based port check to verify if the port was &lt;em&gt;truly&lt;/em&gt; bound before declaring the server ready.&lt;/p&gt;

&lt;h2&gt;
  
  
  So, was this worth it?
&lt;/h2&gt;

&lt;p&gt;The results speak for themselves. The installation package size plummeted compared to the Electron version, and memory usage is significantly lower. The combination of a React UI and the speed of Tauri v2, backed by the raw power of Python and Ollama, makes for a truly premium developer tool experience.&lt;/p&gt;

</description>
      <category>electron</category>
      <category>tauri</category>
      <category>smoodit</category>
      <category>llm</category>
    </item>
    <item>
      <title>'Chainguard' image for secure service</title>
      <dc:creator>kination</dc:creator>
      <pubDate>Sat, 17 Jan 2026 05:28:28 +0000</pubDate>
      <link>https://forem.com/kination/chainguard-image-for-secure-service-2jio</link>
      <guid>https://forem.com/kination/chainguard-image-for-secure-service-2jio</guid>
      <description>&lt;p&gt;If you work as &lt;code&gt;DevOps&lt;/code&gt; or &lt;code&gt;system backend&lt;/code&gt; development, one of critical point causing your stress will be 'security'. Even though it is always at the top of the priority list, its substance remains elusive.&lt;/p&gt;

&lt;p&gt;Assume you are maintaining k8s cluster, or docker image(in recent system, most of devops engineer will be related at least one of these). Even if your code is secure, the base image (Debian, Ubuntu, or even Alpine) often carries inherent technical debt, and within that debt, there are often security risks capable of compromising the service. Now, &lt;code&gt;Chainguard&lt;/code&gt; image starts at this point.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Chainguard?
&lt;/h2&gt;

&lt;p&gt;In short, Chainguard Images are "secure-by-default" container images. They are built on top of Wolfi, a Linux "undistro" designed specifically for containers.&lt;/p&gt;

&lt;p&gt;Unlike standard Docker Hub images, Chainguard images have a few distinct characteristics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Distroless: They contain the bare minimum required to run the app. No shell, no package managers (in the runtime), and no bloat.&lt;/li&gt;
&lt;li&gt;Daily Rebuilds: Every image is rebuilt daily from upstream sources to patch vulnerabilities immediately.&lt;/li&gt;
&lt;li&gt;SBOMs &amp;amp; Signing: They come with Software Bill of Materials and Sigstore signatures out of the box.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The Showdown: Official vs Chainguard
&lt;/h3&gt;

&lt;p&gt;The most immediate benefit is reduction in noise. Let's look at a comparison I ran using trivy on a standard Python image and Chainguard.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Official Image (python:3.11): Over 300 vulnerabilities, with some of them categorized as "Critical" or "High"&lt;/li&gt;
&lt;li&gt;Chainguard Image (cgr.dev/chainguard/python:latest): 0 CVEs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This isn't magic; it's just aggressive minimalism. By removing the OS components that your application doesn't actually use, you remove the attack surface.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hands-on: Migration Guide
&lt;/h2&gt;

&lt;p&gt;Migrating isn't always a simple swap, and it's more painful with &lt;code&gt;Chainguard&lt;/code&gt;. This images are distroless, you cannot use apt-get or apk in the final runtime image. You must use multi-stage builds.&lt;/p&gt;

&lt;h3&gt;
  
  
  Tool restriction
&lt;/h3&gt;

&lt;p&gt;A typical (and slightly vulnerable) Dockerfile might look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="c"&gt;# Standard Python Image&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="s"&gt; python:3.9-slim&lt;/span&gt;

&lt;span class="k"&gt;WORKDIR&lt;/span&gt;&lt;span class="s"&gt; /app&lt;/span&gt;

&lt;span class="c"&gt;# Installing dependencies&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; requirements.txt .&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--no-cache-dir&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt

&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; . .&lt;/span&gt;

&lt;span class="c"&gt;# Running as root (default)&lt;/span&gt;
&lt;span class="k"&gt;CMD&lt;/span&gt;&lt;span class="s"&gt; ["python", "app.py"]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It is pretty simple dockerfile, to run python application.&lt;/p&gt;

&lt;p&gt;Here's the migrated version with chainguard image:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1: Make 'builder'&lt;/span&gt;
&lt;span class="c"&gt;# Use '-dev' tag here, because we need a shell and build tools (gcc, headers, etc.)&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;cgr.dev/chainguard/python:latest-dev&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;builder&lt;/span&gt;

&lt;span class="k"&gt;WORKDIR&lt;/span&gt;&lt;span class="s"&gt; /app&lt;/span&gt;

&lt;span class="c"&gt;# Create a virtual environment&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;python &lt;span class="nt"&gt;-m&lt;/span&gt; venv /app/venv
&lt;span class="k"&gt;ENV&lt;/span&gt;&lt;span class="s"&gt; PATH="/app/venv/bin:$PATH"&lt;/span&gt;

&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; requirements.txt .&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--no-cache-dir&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt

&lt;span class="c"&gt;# 2: Actual 'runner'&lt;/span&gt;
&lt;span class="c"&gt;# This image has no 'shell' and 'package manager'. It is pure runtime.&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="s"&gt; cgr.dev/chainguard/python:latest&lt;/span&gt;

&lt;span class="k"&gt;WORKDIR&lt;/span&gt;&lt;span class="s"&gt; /app&lt;/span&gt;

&lt;span class="c"&gt;# Copy the virtual environment from the builder&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; --from=builder /app/venv /app/venv&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; . .&lt;/span&gt;

&lt;span class="c"&gt;# Set path to use the virtualenv&lt;/span&gt;
&lt;span class="k"&gt;ENV&lt;/span&gt;&lt;span class="s"&gt; PATH="/app/venv/bin:$PATH"&lt;/span&gt;

&lt;span class="c"&gt;# chainguard runs as a non-root user ('nonroot') by default.&lt;/span&gt;
&lt;span class="c"&gt;# No need to create a user manually.&lt;/span&gt;
&lt;span class="k"&gt;CMD&lt;/span&gt;&lt;span class="s"&gt; ["python", "app.py"]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Gotchas and Troubleshooting&lt;br&gt;
"I can't exec into the pod!": Since the runtime image has no shell (/bin/sh or bash), you cannot run kubectl exec -it  -- sh. If you need to debug, you must use ephemeral debug containers or temporarily switch to the -dev image tag.&lt;/p&gt;

&lt;p&gt;Permissions: Chainguard enforces non-root execution. If your app tries to write to / or install packages at runtime, it will fail. Ensure your app only writes to designated directories (like /tmp or a mounted volume) where the nonroot user has permissions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Package restriction
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Work without &lt;code&gt;sudo&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;"Why can't I run or install sudo?" &lt;/p&gt;

&lt;p&gt;...is perhaps the most common question when first switching to Chainguard(so do I).&lt;/p&gt;

&lt;p&gt;By security-first design, chainguard images do not include &lt;code&gt;sudo&lt;/code&gt; command (and mostly they do not allow to install through package manager). They are built to follow the principle of least privilege, running as a pre-configured non-root user (UID 65532) by default. This effectively eliminates entire classes of privilege escalation attacks.&lt;/p&gt;

&lt;p&gt;The Problem: Legacy setup scripts or runtime commands that rely on sudo will fail with a command not found error.&lt;/p&gt;

&lt;p&gt;The Solution: Frankly, there are no simple solution. &lt;/p&gt;

&lt;p&gt;You must shift your mindset to the &lt;code&gt;build&lt;/code&gt; stage. First thing you need to do is refactoring application design, to run without root access as possible. For example, Ii your app needs to bind to a privileged port (like 80), don't make it go on by using sudo. Instead, bind to a higher port (like 8080) and handle the mapping at the infrastructure level (e.g., Kubernetes Service or Load Balancer).&lt;/p&gt;

&lt;p&gt;But still there will be several process still needs it. Any operation requiring root privileges, such as installing system dependencies or modifying file permissions, must be performed in a multi-stage build using the USER root directive.&lt;/p&gt;

&lt;h3&gt;
  
  
  "Latest-Only" Package Philosophy
&lt;/h3&gt;

&lt;p&gt;Chainguard's underlying package ecosystem, &lt;code&gt;Wolfi&lt;/code&gt;, operates on a Rolling Release model. This introduces a specific constraint regarding package versioning.&lt;/p&gt;

&lt;p&gt;Fixed Repo Constraints: Chainguard images are configured to pull exclusively from trusted Wolfi or Chainguard repositories. Adding third-party or unverified repos is strongly discouraged to maintain the "Zero CVE" guarantee.&lt;/p&gt;

&lt;p&gt;The Latest-Only Rule: In the public Wolfi repository, generally only the latest stable version of a package is maintained.&lt;/p&gt;

&lt;p&gt;If you try to pin an outdated version (e.g., apk add openssl=1.1.1), you will likely encounter an "Unsatisfiable constraints" error because that version has been removed from the index to prevent the use of vulnerable software.&lt;/p&gt;

&lt;p&gt;The Philosophy: Chainguard forces you to stay current. This "latest-only" approach ensures that you aren't unknowingly baking known vulnerabilities into your images. If your system absolutely requires a legacy, EOL (End-of-Life) version, you may need to consider their Enterprise tier, which provides a private repository for older, patched versions.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Trap of "Legacy Version" Maintenance
&lt;/h2&gt;

&lt;p&gt;One of the biggest reasons teams stick to old Docker images (like node:14 or python:3.6) for production-level service, is stability. &lt;/p&gt;

&lt;p&gt;"&lt;strong&gt;It works&lt;/strong&gt;, so please don't touch it."&lt;/p&gt;

&lt;p&gt;However, in the standard Docker world, if you pin a version like python:3.6-slim, that base OS (likely an old Debian version) stops receiving security updates. You are sitting on a ticking time bomb of OS-level vulnerabilities. &lt;/p&gt;

&lt;p&gt;Frankly, everybody knows they needs to do something, but they cannot. Migrating production software is not just changing some code and package. It needs long-term testing, thinking of millions of use cases. Modern software are built on top of giant "system &amp;amp; source codes", and even top-level software engineer cannot expect all of exceptional cases. That's why even well-skilled engineers are concerning of migration every time.&lt;/p&gt;

&lt;p&gt;Chainguard changes this paradigm. Even if you use a specific language version, Chainguard rebuilds the underlying Wolfi OS layers every single day.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The Good: You get the stability of your language version with the security of a bleeding-edge OS.&lt;/li&gt;
&lt;li&gt;The Bad: The image hash changes daily. If your deployment pipeline relies on the exact same SHA digest existing for months, this will break your flow. You need to get used to the idea of Rolling Tags.&lt;/li&gt;
&lt;li&gt;The Cost: It is worth noting that for End-of-Life (EOL) language versions (like Python 3.7 or Java 8), Chainguard usually moves those images to their paid tier. The free tier focuses on currently supported versions.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Reference
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.chainguard.dev/" rel="noopener noreferrer"&gt;https://www.chainguard.dev/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://edu.chainguard.dev/" rel="noopener noreferrer"&gt;https://edu.chainguard.dev/&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>linux</category>
      <category>docker</category>
      <category>security</category>
      <category>chainguard</category>
    </item>
    <item>
      <title>"Container" with OCI runtime</title>
      <dc:creator>kination</dc:creator>
      <pubDate>Sun, 17 Aug 2025 09:40:28 +0000</pubDate>
      <link>https://forem.com/kination/container-with-oci-runtime-1m3g</link>
      <guid>https://forem.com/kination/container-with-oci-runtime-1m3g</guid>
      <description>&lt;h2&gt;
  
  
  Rise of Docker project
&lt;/h2&gt;

&lt;p&gt;Introduced in 2010 by "dotCloud"(renamed to &lt;code&gt;Docker Inc.&lt;/code&gt;), and now docker became common standard of containerization. Also, it made containerizing tech to be very close to every engineers(not only infrastructure part), to setup common development environment easily for co-working engineers&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcnpqr3tin6nvmszbk6s2.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcnpqr3tin6nvmszbk6s2.jpeg" alt="https://x.com/Docker" width="800" height="538"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It is running based on container internally, which is a lightweight, standalone, and executable software package that encapsulates an application and all its dependencies, including code, runtime, system tools, libraries, and settings.&lt;/p&gt;

&lt;h2&gt;
  
  
  Container?
&lt;/h2&gt;

&lt;p&gt;Probably you will be familiar with "Docker image", which is kind of blueprint to create instance. Containers are composed by one or more instances. Which means image became instance when running (such as &lt;code&gt;docker run&lt;/code&gt;), it becomes container, providing isolated/consistent environment to make application execute based on 'runtime'.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy28zazqugtxdy8ku4dwr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy28zazqugtxdy8ku4dwr.png" alt=" " width="800" height="297"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Runtime of container is low-level software component, responsible for executing and managing containers on a host operating system. It acts as the interface between the container and underlying system resources.&lt;/p&gt;

&lt;p&gt;Initially, Docker didn't have its own low-level container runtime, so it has relied on existing Linux Kernel feature for processing isolation named LXC (Linux Containers). However, this created a dependency on a specific kernel API and had some limitations in portability and functionality.&lt;/p&gt;

&lt;p&gt;To avoid these, Docker developed "libcontainer". This provides  native Go implementation for creating containers with namespaces, cgroups, capabilities, and filesystem access controls. It allows you to manage the lifecycle of the container performing additional operations after the container is created.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is runC?
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;runc&lt;/code&gt; is a CLI tool for spawning and running containers on Linux according to the OCI specification. In other words, it handles  low-level works to create/manage containers based on standard specification named OCI Runtime Specification(let's talk about this bit later). &lt;/p&gt;

&lt;p&gt;When you run a command like &lt;code&gt;docker run&lt;/code&gt;, docker daemon (in detail, it will be higher-level runtime like containerd) ultimately calls runC to perform the actual container creation.&lt;/p&gt;

&lt;h2&gt;
  
  
  OCI Runtime?
&lt;/h2&gt;

&lt;p&gt;Open Container Initiative Runtime, or OCI Runtime, is specification that defines how to run a containerized application. This specification provides a standardized way to define the configuration and lifecycle of a container, ensuring that any OCI-compliant runtime can correctly execute a container created by another. &lt;/p&gt;

&lt;p&gt;The core of this specification is the filesystem bundle which contains:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;rootfs: A directory containing the root filesystem of the container.&lt;/li&gt;
&lt;li&gt;config.json: A JSON file, that defines all the configuration for container process.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;code&gt;config.json&lt;/code&gt; is the configuration to generate container. It specifies essential details for OCI-Runtime to set up the isolated environment correctly. &lt;/p&gt;

&lt;p&gt;Here's some sample:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"ociVersion"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1.0.2"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"process"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"terminal"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"user"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"uid"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"gid"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"/bin/sh"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"env"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"root"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"path"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"rootfs"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"readonly"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"hostname"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"my-container"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mounts"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"linux"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"namespaces"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"pid"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"network"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ipc"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"uts"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"mount"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"cgroupsPath"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"/my-container-cgroup"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can find full specification here -&amp;gt; &lt;a href="https://specs.opencontainers.org/runtime-spec/config/" rel="noopener noreferrer"&gt;https://specs.opencontainers.org/runtime-spec/config/&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementing Basic OCI-Runtime
&lt;/h2&gt;

&lt;p&gt;Implementing a basic OCI-compliant runtime involves utilizing core Linux kernel features to creat isolated environment based on the "config.json" file. &lt;/p&gt;

&lt;p&gt;This is key requirements:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Parse config.json: The runtime will first read and parse the file, to understand the container's configuration.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Create &lt;code&gt;namespaces&lt;/code&gt;: It must call system calls like "unshare" or "clone" to create new Linux namespaces (PID, mount, UTS, network, etc.) as specified in the configuration file. These namespaces isolates container's processes from host system.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Set up &lt;code&gt;cgroups&lt;/code&gt;: The runtime needs to interact with "cgroup" filesystem (/sys/fs/cgroup) to create and configure the control group for new container. This allows to apply resource limits for CPU, memory, I/O, etc..&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Change root filesystem: Perform &lt;code&gt;pivot_root&lt;/code&gt; or &lt;code&gt;chroot&lt;/code&gt; system call to change root filesystem of the container process to &lt;code&gt;rootfs&lt;/code&gt; directory specified in config. This ensures the container can access its own filesystem only.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Execute: Finally, the runtime triggers &lt;code&gt;execve&lt;/code&gt; system call to replace its own process to container's main command (e.g., /bin/sh) inside new, isolated environment. This new process becomes PID 1 inside the container's namespace.&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="nn"&gt;[package]&lt;/span&gt;
&lt;span class="py"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"my-oci-rt"&lt;/span&gt;
&lt;span class="py"&gt;version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"0.0.0"&lt;/span&gt;
&lt;span class="py"&gt;edition&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"2021"&lt;/span&gt;

&lt;span class="nn"&gt;[dependencies]&lt;/span&gt;
&lt;span class="py"&gt;nix&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="py"&gt;version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"0.30.0"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="py"&gt;features&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"sched"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"process"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"user"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"hostname"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"fs"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="py"&gt;serde&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="py"&gt;version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"1.0"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="py"&gt;features&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"derive"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="py"&gt;serde_json&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"1.0"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;nix&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;wait&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;waitpid&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;nix&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;unistd&lt;/span&gt;&lt;span class="p"&gt;::{&lt;/span&gt;&lt;span class="n"&gt;chroot&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;setgid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sethostname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;setuid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Gid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Uid&lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;nix&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;sched&lt;/span&gt;&lt;span class="p"&gt;::{&lt;/span&gt;&lt;span class="n"&gt;clone&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CloneFlags&lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;serde&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Deserialize&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;std&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;ffi&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;CString&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;std&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;std&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;path&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;PathBuf&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;


&lt;span class="nd"&gt;#[derive(Deserialize,&lt;/span&gt; &lt;span class="nd"&gt;Debug)]&lt;/span&gt;
&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;Spec&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;process&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Process&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Root&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;hostname&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Option&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nd"&gt;#[derive(Deserialize,&lt;/span&gt; &lt;span class="nd"&gt;Debug)]&lt;/span&gt;
&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;Process&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Vec&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;User&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nd"&gt;#[derive(Deserialize,&lt;/span&gt; &lt;span class="nd"&gt;Debug)]&lt;/span&gt;
&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;User&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;uid&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;u32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;gid&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;u32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nd"&gt;#[derive(Deserialize,&lt;/span&gt; &lt;span class="nd"&gt;Debug)]&lt;/span&gt;
&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;Root&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;PathBuf&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// size of stack (1MB)&lt;/span&gt;
&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;STACK_SIZE&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;usize&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;raw_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;read_to_string&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"config.json"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Failed to read config.json"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;spec&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Spec&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;serde_json&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;from_str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;raw_config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Failed to parse config.json"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// Create child process&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;stack&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;STACK_SIZE&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;

    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;child_closure&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;||&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;isize&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nd"&gt;println!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"== Inside child process {} =="&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nn"&gt;nix&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;unistd&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;getpid&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nf"&gt;Some&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hostname&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;spec&lt;/span&gt;&lt;span class="py"&gt;.hostname&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="nf"&gt;sethostname&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hostname&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Failed setting hostname"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="nf"&gt;chroot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;spec&lt;/span&gt;&lt;span class="py"&gt;.root.path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"chroot failed"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="nn"&gt;std&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;env&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;set_current_dir&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Failed changing directory to /"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

        &lt;span class="c1"&gt;// Setup gid/uid &lt;/span&gt;
&lt;span class="nf"&gt;setgid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;Gid&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;from_raw&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spec&lt;/span&gt;&lt;span class="py"&gt;.process.user.gid&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="nf"&gt;.expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Failed setting GID"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="nf"&gt;setuid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;Uid&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;from_raw&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spec&lt;/span&gt;&lt;span class="py"&gt;.process.user.uid&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="nf"&gt;.expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Failed setting UID"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

        &lt;span class="c1"&gt;// parse/trigger command&lt;/span&gt;
        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;command&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;spec&lt;/span&gt;&lt;span class="py"&gt;.process.args&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Vec&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;CString&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spec&lt;/span&gt;&lt;span class="py"&gt;.process.args&lt;/span&gt;&lt;span class="nf"&gt;.iter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="nf"&gt;.map&lt;/span&gt;&lt;span class="p"&gt;(|&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt; &lt;span class="nn"&gt;CString&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="nf"&gt;.as_str&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;&lt;span class="nf"&gt;.unwrap&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
            &lt;span class="nf"&gt;.collect&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

        &lt;span class="c1"&gt;// swap to new process&lt;/span&gt;
        &lt;span class="nn"&gt;nix&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;unistd&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;execvp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="nn"&gt;CString&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;command&lt;/span&gt;&lt;span class="nf"&gt;.as_str&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;&lt;span class="nf"&gt;.unwrap&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"execvp failed"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

        &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="p"&gt;};&lt;/span&gt;

    &lt;span class="c1"&gt;// generate namespace for uts, pid, namespace&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;clone_flags&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;CloneFlags&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;CLONE_NEWUTS&lt;/span&gt; &lt;span class="p"&gt;|&lt;/span&gt; &lt;span class="nn"&gt;CloneFlags&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;CLONE_NEWPID&lt;/span&gt; &lt;span class="p"&gt;|&lt;/span&gt; &lt;span class="nn"&gt;CloneFlags&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;CLONE_NEWNS&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;child_pid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;unsafe&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nf"&gt;clone&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;Box&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;child_closure&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;stack&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;clone_flags&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;Some&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;nix&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;signal&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;Signal&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;SIGCHLD&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nb"&gt;i32&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="nf"&gt;.expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"clone failed"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="nd"&gt;println!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"== In Parent Process =="&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nd"&gt;println!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Spawned child with PID: {}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;child_pid&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// wait shutting down child process&lt;/span&gt;
    &lt;span class="nf"&gt;waitpid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;child_pid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"waitpid failed"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nd"&gt;println!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Child process exited."&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Things to prepare:
&lt;/h4&gt;

&lt;p&gt;To run this project, prepare following features&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Simple &lt;code&gt;config.json&lt;/code&gt;:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"ociVersion"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1.0.0"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"process"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"/bin/sh"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"-c"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; 
      &lt;/span&gt;&lt;span class="s2"&gt;"echo 'Hello container!' &amp;amp;&amp;amp; ls -l"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"user"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"uid"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"gid"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"root"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"path"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"rootfs"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"hostname"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"my-container"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Root filesystem for container, &lt;code&gt;rootfs&lt;/code&gt; directory:&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let's use simple environment with "busybox"&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;mkdir &lt;/span&gt;rootfs
&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;rootfs
&lt;span class="nv"&gt;$ &lt;/span&gt;wget https://busybox.net/downloads/binaries/1.35.0-x86_64-linux-musl/busybox
&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;chmod&lt;/span&gt; +x busybox
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Result
&lt;/h4&gt;

&lt;p&gt;Now, you will see message from container as following&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ ./target/debug/my-container
== In Parent Process ==
Spawned child with PID: 4500
== In Child Process (PID: 1) ==

Hello container!
total 1108
drwxr-xr-x    4 0        0              128 Aug 18 02:52 bin
-rwxr-xr-x    1 0        0          1131168 Jan 17  2022 busybox

Child process exited.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Note&amp;gt;
&lt;/h4&gt;

&lt;p&gt;&lt;code&gt;nix&lt;/code&gt; library only allows you to use system calls that exist on the operating system you're compiling for. So, when compiling on macOS, the compiler will throw following error:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;...
  |
1 | use nix::sched::{clone, CloneFlags};
  |                  ^^^^^  ^^^^^^^^^^ no `CloneFlags` in `sched`
  |                  |
  |                  no `clone` in `sched`
  |
  = help: consider importing this module instead:
          std::clone
...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;nix::sched::clone&lt;/code&gt; module and function are Linux-specific, so it is not included in nix library for macOS. &lt;/p&gt;

&lt;p&gt;If you're trying to use this in Mac, build/run with separated linux environment with "Docker Desktop".&lt;/p&gt;

&lt;h2&gt;
  
  
  Reference:
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/opencontainers/runtime-spec" rel="noopener noreferrer"&gt;https://github.com/opencontainers/runtime-spec&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/opencontainers/runc" rel="noopener noreferrer"&gt;https://github.com/opencontainers/runc&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://medium.com/@fred.j/docker-in-depth-first-part-history-499682db0c12" rel="noopener noreferrer"&gt;https://medium.com/@fred.j/docker-in-depth-first-part-history-499682db0c12&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://medium.com/@avijitsarkar123/docker-and-oci-runtimes-a9c23a5646d6" rel="noopener noreferrer"&gt;https://medium.com/@avijitsarkar123/docker-and-oci-runtimes-a9c23a5646d6&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iximiuz.com/en/posts/oci-containers/" rel="noopener noreferrer"&gt;https://iximiuz.com/en/posts/oci-containers/&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>linux</category>
      <category>docker</category>
      <category>container</category>
      <category>rust</category>
    </item>
    <item>
      <title>'Spark Connect' in Apache Spark 4.0</title>
      <dc:creator>kination</dc:creator>
      <pubDate>Thu, 17 Jul 2025 03:01:18 +0000</pubDate>
      <link>https://forem.com/kination/spark-connect-in-apache-spark-40-13e7</link>
      <guid>https://forem.com/kination/spark-connect-in-apache-spark-40-13e7</guid>
      <description>&lt;p&gt;Apache Spark 4.0 became official few month ago, with lots of  enhancements and new features. &lt;/p&gt;

&lt;p&gt;While the release includes various improvements, the evolution of &lt;code&gt;Spark Connect&lt;/code&gt; particularly stands out as a significant leap forward, making remote Spark sessions more seamless and powerful.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is &lt;code&gt;Spark Connect&lt;/code&gt;?
&lt;/h2&gt;

&lt;p&gt;If you're unfamiliar with this, you can think as "decoupled client-server architecture" for Apache Spark. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frjj1cclfxn5bklwliz26.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frjj1cclfxn5bklwliz26.png" alt="spark-connect-architecture" width="800" height="449"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As you see in this image, it has client and server component. By using client API, it allows developers to interact with a Spark cluster from any application or environment, such as IDEs, notebooks, and applications written in various languages. You can think of similarity with &lt;code&gt;Apache Livy&lt;/code&gt;, but it has much more thin client to make application lightweight. Moreover, it is based on same API format and name with classic spark module, so it does not require additional API translation from client.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw6rjssbf2pvi0t8a7mki.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw6rjssbf2pvi0t8a7mki.png" alt="spark-connect-communicate" width="800" height="449"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It starts by translating DataFrame operations into unresolved logical plans encoded using protocol buffers in client side. These are sent to the server using the gRPC framework, and received protobuf messages will be reconstructed to logical plan. From there, common Spark execution process goes on, such as optimizing logical plan, transforming to physical plan, and execute.&lt;/p&gt;

&lt;p&gt;After execution, result will be return in "Apache Arrow" record format, streamed back with gRPC, so can be read sequentially in client side.&lt;/p&gt;

&lt;p&gt;This separation of client and server provides greater flexibility, easier dependency management.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's in Spark 4.0 for this
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Enhanced API Coverage and Stability
&lt;/h3&gt;

&lt;p&gt;Previous &lt;code&gt;Spark Connect&lt;/code&gt; user may have encountered missing or partially implemented APIs. The development team has made a great effort to address these gaps in Spark 4.0. &lt;/p&gt;

&lt;p&gt;Through hard work from engineers done in &lt;a href="https://issues.apache.org/jira/browse/SPARK-47908" rel="noopener noreferrer"&gt;SPARK-47908&lt;/a&gt; and &lt;a href="https://issues.apache.org/jira/browse/SPARK-49248" rel="noopener noreferrer"&gt;SPARK-49248&lt;/a&gt;, &lt;code&gt;Spark Connect&lt;/code&gt; now achieved much higher API parity with common Spark implementation. This means that more of your existing Spark code will work seamlessly in 'connect' mode, without requiring extensive rewrites. &lt;/p&gt;

&lt;p&gt;This improved compatibility makes migrating existing applications to the connect architecture a much smoother and more predictable process.&lt;/p&gt;

&lt;h3&gt;
  
  
  Simple trigger switch for 'connect' mode
&lt;/h3&gt;

&lt;p&gt;One of the most user-friendly changes in Spark 4.0 is the simplified process to enable &lt;code&gt;Spark Connect&lt;/code&gt;. Previously, setting up a remote connection required more specific configuration. Now activating it become simple, by adding single configuration property.&lt;/p&gt;

&lt;p&gt;By progress in &lt;a href="https://issues.apache.org/jira/browse/SPARK-50605" rel="noopener noreferrer"&gt;SPARK-50605&lt;/a&gt;, you can now switch to 'connect' mode just by setting "spark.api.mode"&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;

&lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.api.mode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;connect&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;master&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;getOrCreate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;or&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;spark-submit &lt;span class="nt"&gt;--master&lt;/span&gt; &lt;span class="s2"&gt;"..."&lt;/span&gt; &lt;span class="nt"&gt;--conf&lt;/span&gt; spark.api.mode&lt;span class="o"&gt;=&lt;/span&gt;connect
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Default parameter for this is 'classic', and it will work in this way if you don't define anything, for ensuring backward compatibility. It is a small change, but significantly lowers the barrier to entry for developers looking to leverage the benefits of remote, interactive Spark environment.&lt;/p&gt;

&lt;h3&gt;
  
  
  Machine Learning modules
&lt;/h3&gt;

&lt;p&gt;Another significant leap will be Machine Learning (ML) capabilities. As outlined in &lt;a href="https://issues.apache.org/jira/browse/SPARK-50812" rel="noopener noreferrer"&gt;SPARK-50812&lt;/a&gt;, you can now execute not only SQL/DataFrame operations, but also ML workloads through &lt;code&gt;Spark Connect&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This will be great improvements for data scientists who want to build, train, and manage models on a powerful cluster directly from their local development environment or notebook.&lt;/p&gt;

&lt;p&gt;And they are still working for additional improvements which targets 4.1 in &lt;a href="https://issues.apache.org/jira/browse/SPARK-51236" rel="noopener noreferrer"&gt;SPARK-51236&lt;/a&gt;, so you can expect for this too.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reference:
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://spark.apache.org/" rel="noopener noreferrer"&gt;https://spark.apache.org/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/apache/spark" rel="noopener noreferrer"&gt;https://github.com/apache/spark&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>dataengineering</category>
      <category>spark</category>
      <category>sparkconnect</category>
    </item>
    <item>
      <title>Retrieval-Augmented Generation (RAG) for AI starter</title>
      <dc:creator>kination</dc:creator>
      <pubDate>Thu, 10 Jul 2025 12:30:11 +0000</pubDate>
      <link>https://forem.com/kination/retrieval-augmented-generation-rag-for-ai-starter-4mdf</link>
      <guid>https://forem.com/kination/retrieval-augmented-generation-rag-for-ai-starter-4mdf</guid>
      <description>&lt;h2&gt;
  
  
  AI everywhere, but it's not private
&lt;/h2&gt;

&lt;p&gt;With advancement of AI technologies, now it's possible to access a wide range of information through conversational interfaces, with just asking "what is xxx?". Because of strong usefulness, many companies are encouraging employees to use AI chatbots to boost productivity, and some of them even covers subscription costs.&lt;/p&gt;

&lt;p&gt;However, these AI models are trained with publicly available data, and naturally, they don't know about any proprietary or internal company information. For instance, if an employee asks, "What does it take to get a good performance review at work?", a general-purpose AI may respond with vague advice like "Deliver good results", rather than offering company-specific guidelines.&lt;/p&gt;

&lt;h2&gt;
  
  
  Introduction to RAG(Retrieval-Augmented Generation)
&lt;/h2&gt;

&lt;p&gt;To build an AI system that can answer such internal questions accurately, organizations typically need to integrate their own data into the AI's capabilities. &lt;/p&gt;

&lt;p&gt;One common approach is to train a generative AI model with internal documents through a method known as Fine-tuning. However, this is often resource-intensive, requiring significant time and cost, and it's inefficient when internal information changes frequently. For this reason, the Retrieval-Augmented Generation(a.k.a RAG) approach is generally preferred. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fadybs4nhb9febk923445.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fadybs4nhb9febk923445.png" alt="what-is-rag" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;RAG leaves the base generative model untouched, and retrieves relevant documents at query time. When a user submits a question, the system searches internal knowledge sources, retrieves the most relevant content, and provides it to the chatbot as context. This allows the AI to deliver accurate and up-to-date responses, while also enabling users to trace the source of the information.&lt;/p&gt;

&lt;p&gt;Following code is simple example of this logic, using &lt;code&gt;gemini&lt;/code&gt; API&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;google.generativeai&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;

&lt;span class="n"&gt;genai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;configure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;your-api-key&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;GenerativeModel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;gemini-model-name&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;gemini_rag_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context_docs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context_docs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Introduction:
    &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

    Question: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

    Please describe by referring &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Introduction&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate_content&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;


&lt;span class="n"&gt;cont_doc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
My name is &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;kination&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.
I&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;m software engineer. My skill is

Programming:
Skilled: Java, Python, TypeScript(JavaScript), Scala
Others: Rust

Experienced in:
Data Engineering
Cloud Infrastructure
Application development: Web, Android

Language:
Korean: Native
English: Business Level
Japanese: Conversational (JLPT - N2 certificated)
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;gemini_rag_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Could you describe about user kination?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cont_doc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and here's the result:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;...
Based on the provided introduction, &lt;span class="s1"&gt;'kination'&lt;/span&gt; is a software engineer.

They possess strong programming skills, being skilled &lt;span class="k"&gt;in &lt;/span&gt;Java, Python, TypeScript &lt;span class="o"&gt;(&lt;/span&gt;JavaScript&lt;span class="o"&gt;)&lt;/span&gt;, and Scala, with additional familiarity with Rust. Their experience extends to Data Engineering, Cloud Infrastructure, and application development &lt;span class="k"&gt;for &lt;/span&gt;both web and Android platforms.

Regarding languages, &lt;span class="s1"&gt;'kination'&lt;/span&gt; is a native Korean speaker, proficient &lt;span class="k"&gt;in &lt;/span&gt;English at a business level, and conversational &lt;span class="k"&gt;in &lt;/span&gt;Japanese, holding a JLPT - N2 certificate.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you input other username(such as 'Jimmy Rock'), it will response&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;There is no information about &lt;span class="s1"&gt;'Jimmy Rock'&lt;/span&gt; &lt;span class="k"&gt;in &lt;/span&gt;this document...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this sample, content has been offered with raw text as variable. So if you need to setup document with files such as pdf, word, it needs additional logic to setup contents as chunk data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain.document_loaders&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;PyPDFLoader&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;UnstructuredWordDocumentLoader&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain.text_splitter&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RecursiveCharacterTextSplitter&lt;/span&gt;


&lt;span class="n"&gt;pdf_loader&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PyPDFLoader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kination.pdf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# If content is word
# word_loader = UnstructuredWordDocumentLoader("kination.docx")
&lt;/span&gt;
&lt;span class="n"&gt;text_splitter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;RecursiveCharacterTextSplitter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;250&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunk_overlap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;text_splitter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split_documents&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pdf_loader&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="bp"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This typically involves splitting the content into smaller, manageable chunks, which helps improve the relevance of search results during retrieval.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmggu8dlto9ukclqvx7bf.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmggu8dlto9ukclqvx7bf.webp" alt="vector" width="800" height="277"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once the content is chunked, each chunk needs to be transformed into a &lt;code&gt;vector&lt;/code&gt;. It is some kind of numerical representation that captures the semantic meaning of text (usually with using embedding models). These vectors are then stored in vector DB, which allows efficient similarity search with user's query.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reference
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.datasciencecentral.com/rag-and-its-evolution/" rel="noopener noreferrer"&gt;https://www.datasciencecentral.com/rag-and-its-evolution/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://sendbird.com/blog/what-is-an-ai-agent/rag" rel="noopener noreferrer"&gt;https://sendbird.com/blog/what-is-an-ai-agent/rag&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.pinecone.io/learn/vector-embeddings/" rel="noopener noreferrer"&gt;https://www.pinecone.io/learn/vector-embeddings/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/use-cases/retrieval-augmented-generation" rel="noopener noreferrer"&gt;https://cloud.google.com/use-cases/retrieval-augmented-generation&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>rag</category>
      <category>ai</category>
      <category>python</category>
      <category>gemini</category>
    </item>
    <item>
      <title>Setting up memory for Flink 2 - What to think about...</title>
      <dc:creator>kination</dc:creator>
      <pubDate>Mon, 23 Jun 2025 08:22:50 +0000</pubDate>
      <link>https://forem.com/kination/setting-up-memory-for-flink-2-what-to-think-about-m0i</link>
      <guid>https://forem.com/kination/setting-up-memory-for-flink-2-what-to-think-about-m0i</guid>
      <description>&lt;p&gt;Previously, I've written about memory configuration at Flink.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.to/kination/setting-up-memory-for-flink-configuration-4jm1"&gt;https://dev.to/kination/setting-up-memory-for-flink-configuration-4jm1&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;...and this is the considerations you need for managing memory.&lt;/p&gt;

&lt;h2&gt;
  
  
  General
&lt;/h2&gt;

&lt;p&gt;By understanding and configuring these memory types appropriately, you can optimize Flink's performance and ensure efficient resource utilization in your applications.&lt;/p&gt;

&lt;h3&gt;
  
  
  Balancing Memory Types
&lt;/h3&gt;

&lt;p&gt;It's important to balance the allocation of these different memory types based on the specific needs of your application. Over-allocating one type of memory can lead to under-utilization of others.&lt;/p&gt;

&lt;h3&gt;
  
  
  Monitoring and Tuning
&lt;/h3&gt;

&lt;p&gt;Regularly monitor memory usage and performance metrics to fine-tune memory settings. Tools like Flink's Web UI and external monitoring solutions can provide insights into memory usage patterns.&lt;/p&gt;

&lt;h2&gt;
  
  
  JobManager
&lt;/h2&gt;

&lt;p&gt;Increasing the memory allocated to JobManager in Apache Flink can significantly enhance the stability and performance of your Flink cluster, particularly in large-scale deployments with numerous jobs. Here's how:&lt;/p&gt;

&lt;h3&gt;
  
  
  Improved Job Scheduling and Management
&lt;/h3&gt;

&lt;p&gt;With more memory, JobManager can handle a larger number of concurrent jobs. This is crucial in environments where multiple jobs are submitted and executed simultaneously.&lt;br&gt;
Efficient Scheduling Increased memory allows JobManager to maintain more detailed metadata about jobs, which can improve the efficiency of job scheduling and resource allocation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Enhanced Metadata Management
&lt;/h3&gt;

&lt;p&gt;The JobManager stores job graphs and execution plans in memory. With more memory, it can efficiently manage and store these data structures for a larger number of jobs.&lt;br&gt;
Also, JobManager coordinates checkpoints across all tasks. More memory allows it to handle complex checkpointing scenarios, especially in jobs with large state sizes or many parallel tasks.&lt;/p&gt;

&lt;h3&gt;
  
  
  Better Fault Tolerance
&lt;/h3&gt;

&lt;p&gt;In the event of a failure, JobManager needs to recover job states and restart jobs. More memory enables it to store comprehensive recovery information, reducing downtime and improving fault tolerance.&lt;br&gt;
For jobs with large states JobManager can manage state backends more effectively, ensuring that state snapshots and recovery processes are handled smoothly.&lt;/p&gt;

&lt;p&gt;Improved Performance in High-Throughput Scenarios&lt;br&gt;
JobManager processes various events, such as task status updates and resource allocation requests. More memory allows it to handle a higher volume of events without becoming a bottleneck. By having sufficient memory to manage internal queues and buffers JobManager can reduce latency in job execution and coordination.&lt;/p&gt;

&lt;h2&gt;
  
  
  Considerations
&lt;/h2&gt;

&lt;p&gt;While increasing JobManager memory can provide these benefits, it's important to balance this with the overall resource availability in your cluster. Over-allocating memory to the JobManager might lead to resource constraints elsewhere.&lt;/p&gt;

&lt;p&gt;Regularly monitor JobManager's memory usage and performance metrics to ensure that the allocated memory is being utilized effectively. Adjust configurations based on observed performance and workload characteristics.&lt;/p&gt;

&lt;p&gt;By allocating sufficient memory to JobManager, you can enhance the robustness and efficiency of your Flink deployment, especially in scenarios involving complex, high-throughput, or large-scale job executions.&lt;/p&gt;

&lt;h2&gt;
  
  
  TaskManager Memory
&lt;/h2&gt;

&lt;p&gt;Allocating more memory to TaskManager in Apache Flink can significantly enhance its ability to handle larger state backends, improve data processing efficiency, and reduce I/O operations. Here's how:&lt;/p&gt;

&lt;h3&gt;
  
  
  Larger State Backends
&lt;/h3&gt;

&lt;p&gt;Flink's stateful stream processing allows tasks to maintain state across events. This state is crucial for operations like aggregations, joins, and windowing.&lt;/p&gt;

&lt;p&gt;With more memory, a larger portion of the state can be kept in memory, which is faster to access than disk-based storage. This is particularly beneficial for applications with large state requirements.&lt;/p&gt;

&lt;p&gt;Also, when memory is limited, Flink may need to spill state to disk, which can slow down processing. More memory reduces the need for spilling, allowing for faster state access and updates.&lt;/p&gt;

&lt;h3&gt;
  
  
  More Efficient Data Processing
&lt;/h3&gt;

&lt;p&gt;Additional memory allows for larger buffers and caches, which can improve the throughput of data processing tasks. This is especially important for operations that involve sorting, grouping, or joining large datasets.&lt;/p&gt;

&lt;p&gt;More memory can support higher levels of parallelism by allowing more task slots per TaskManager. This enables the processing of more data in parallel, improving overall job throughput.&lt;/p&gt;

&lt;p&gt;Sufficient memory helps in managing data flow more effectively, reducing the likelihood of back-pressure, which can occur when downstream tasks cannot keep up with upstream data production.&lt;/p&gt;

&lt;h3&gt;
  
  
  Reduced I/O Operations
&lt;/h3&gt;

&lt;p&gt;By keeping more data in memory, TaskManagers can minimize the need for disk I/O operations, which are typically slower than memory operations. This is crucial for maintaining high throughput and low latency.&lt;/p&gt;

&lt;p&gt;Also more memory allows for efficient checkpointing by reducing the frequency and size of data written to persistent storage. This can speed up the checkpointing process and reduce the impact on processing performance.&lt;/p&gt;

&lt;p&gt;And more, larger memory allocations can improve network I/O by allowing more data to be buffered and sent in larger batches, reducing the overhead associated with frequent small network transmissions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Considerations
&lt;/h3&gt;

&lt;p&gt;While increasing memory can provide these benefits, it's important to balance memory allocation with other resources like CPU and network bandwidth to avoid bottlenecks.&lt;/p&gt;

&lt;p&gt;Regularly monitor memory usage and performance metrics to ensure that the additional memory is being utilized effectively. Adjust configurations based on workload characteristics and observed performance.&lt;/p&gt;

&lt;p&gt;By allocating more memory to TaskManagers, you can enhance the performance and efficiency of Flink applications, particularly those with large state requirements or high data throughput. This leads to faster data processing, reduced latency, and improved scalability.&lt;/p&gt;

</description>
      <category>flink</category>
      <category>memory</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>How to treat secure data on lakehouse</title>
      <dc:creator>kination</dc:creator>
      <pubDate>Tue, 08 Apr 2025 11:25:46 +0000</pubDate>
      <link>https://forem.com/kination/how-to-treat-secure-data-on-lakehouse-20fa</link>
      <guid>https://forem.com/kination/how-to-treat-secure-data-on-lakehouse-20fa</guid>
      <description>&lt;p&gt;In the modern data stack, the lakehouse has emerged as a hybrid solution that combines the scalability of a data lake with the transactional power of a data warehouse. But with this convergence comes a heightened need to manage secure data responsibly - whether it’s personally identifiable information (PII), health records, or financial transactions.&lt;/p&gt;

&lt;p&gt;In this blog, we’ll walk through how to treat secure data in a &lt;code&gt;Lakehouse&lt;/code&gt; architecture, from ingestion and storage to governance and auditing.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fybp6amxq96mmc4ymsfte.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fybp6amxq96mmc4ymsfte.png" alt="data-lake" width="800" height="536"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Lakehouse?
&lt;/h2&gt;

&lt;p&gt;Lakehouse is a modern data architecture that combines the best of both &lt;code&gt;data lake&lt;/code&gt; and &lt;code&gt;data warehouse&lt;/code&gt;. It aims to deliver the scalability and flexibility of a data lake with the reliability, performance, and governance of a warehouse. By unifying raw and structured data in one platform, lakehouses support analytics, BI, and machine learning without the need for complex ETL pipelines. &lt;/p&gt;

&lt;p&gt;The goal is to get the benefits of both systems while simplifying data workflows. However, while powerful, it also facing challenges on query performance, governance maturity, and integration with legacy tools comparing with each of them.&lt;/p&gt;

&lt;h2&gt;
  
  
  Keep it secure
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Convert existing data
&lt;/h3&gt;

&lt;p&gt;In many data systems, certain columns contain sensitive information(user name, email, ID...)that must be handled differently to comply with privacy regulations and national security policies. To protect this data, various techniques can be applied depending on the use case.&lt;/p&gt;

&lt;p&gt;If only specific teams (ex&amp;gt; research department) need access to the original values, the data can be encrypted using keys that are only shared with those teams. When the original value is not needed but uniqueness is required—for example, for joining datasets—hashing is a suitable approach. In cases where partial visibility is acceptable (ex&amp;gt; showing only the last four digits of a phone number), string masking can be used to obscure parts of the value while retaining some context.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql.functions&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;from_json&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;udf&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql.types&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;StructType&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;StringType&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;cryptography.fernet&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Fernet&lt;/span&gt;  &lt;span class="c1"&gt;# assume using fernet for encryption
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;base64&lt;/span&gt;

&lt;span class="c1"&gt;# ---------- Setup ----------
&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;appName&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SecureSample&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;getOrCreate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# ---------- Sample Schema ----------
&lt;/span&gt;&lt;span class="n"&gt;schema&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;StructType&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
    &lt;span class="nc"&gt;StructField&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;StringType&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
    &lt;span class="nc"&gt;StructField&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;email&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;StringType&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="n"&gt;SECRET_KEY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;fernet&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Fernet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;SECRET_KEY&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;encrypt_email&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;email&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

    &lt;span class="n"&gt;encrypted&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;fernet&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encrypt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;encrypted&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;encrypt_udf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;udf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;encrypt_email&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;StringType&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="c1"&gt;# ---------- Read from Kafka ----------
&lt;/span&gt;&lt;span class="n"&gt;kafka_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;readStream&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kafka&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kafka.bootstrap.servers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;broker-server-host:9092&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;subscribe&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kafka-topic-name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;json_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;kafka_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;selectExpr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CAST(value AS STRING)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;from_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data.*&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# encrypt email column
&lt;/span&gt;&lt;span class="n"&gt;encrypted_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json_df&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;email_encrypted&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;encrypt_udf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;email&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;drop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;email&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# ... do something with encrypted data frame ...
&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Access limitation
&lt;/h3&gt;

&lt;p&gt;Fine-grained access control ensures that only authorized users can access specific data fields, rows, or columns—crucial for protecting sensitive information in a lakehouse.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9jwgrmq57ue85vo1ja63.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9jwgrmq57ue85vo1ja63.png" alt="lake-formation" width="800" height="353"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;AWS offers this through &lt;code&gt;Lake Formation&lt;/code&gt;, which allows you to define column/row level permissions on tables stored in S3. You can grant access to specific users or groups using IAM roles, and manage access using the central Lake Formation console. Integration with AWS Glue Data Catalog ensures governance across multiple tools.&lt;/p&gt;

&lt;p&gt;GCP also provides column-level security in BigQuery, where you can restrict access to specific columns using Data Catalog tags and IAM policies. For row-level security, authorized views or row access policies can be applied to show only the data a user is allowed to see.&lt;/p&gt;

&lt;p&gt;Both platforms support auditing, logging, and attribute-based access control (ABAC) to track and enforce data governance at scale. These features are key to meeting compliance requirements like GDPR or HIPAA while keeping analytics flexible and secure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reference
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://lakefs.io/" rel="noopener noreferrer"&gt;https://lakefs.io/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/lake-formation/" rel="noopener noreferrer"&gt;https://aws.amazon.com/lake-formation/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/bigquery/docs/column-level-security-intro" rel="noopener noreferrer"&gt;https://cloud.google.com/bigquery/docs/column-level-security-intro&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>dataengineering</category>
      <category>python</category>
      <category>spark</category>
    </item>
    <item>
      <title>Time of YAML/JSON for data engineer</title>
      <dc:creator>kination</dc:creator>
      <pubDate>Sun, 06 Apr 2025 09:49:16 +0000</pubDate>
      <link>https://forem.com/kination/time-of-yamljson-for-data-engineer-84c</link>
      <guid>https://forem.com/kination/time-of-yamljson-for-data-engineer-84c</guid>
      <description>&lt;p&gt;If you're tracking trends of data engineering, you could find out that there's a big shift in tools we use. &lt;/p&gt;

&lt;p&gt;Recently, Java(+Scala) and Python have been a go-to languages for data engineers, and it's still true. However, there's a growing trend towards using YAML/JSON configuring to control data pipeline. Will it be a big change to us?&lt;/p&gt;

&lt;h2&gt;
  
  
  "Configuration" on top
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx7fjntjxzftfp2mb55n7.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx7fjntjxzftfp2mb55n7.jpg" alt="config-to-pipeline" width="800" height="332"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Declarative Pipelines
&lt;/h3&gt;

&lt;p&gt;One of the primary reasons data engineers are leaning towards YAML/JSON, is major platform(or cloud) service provider are offering declarative pipelines. These platforms allow(or willing) engineers to define data workflows only by using config files, instead of making from code.&lt;/p&gt;

&lt;p&gt;This approach simplifies the process of setting up and managing complex data pipelines, making it more accessible and causing less exceptional cases.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cloud services and simplified pipelines development
&lt;/h3&gt;

&lt;p&gt;As adoption of cloud services such as AWS, Azure, and GCP increases, companies are building their data pipelines upon these platforms. These providers offers various functions and services that simplifies data pipeline creation and management. &lt;/p&gt;

&lt;p&gt;While these services come at an additional cost, they significantly reduce the complexity points involved in pipeline development. It allows engineers to focus on designing main tasks in high level, rather than struggling in low-level coding.&lt;/p&gt;

&lt;h3&gt;
  
  
  From comfort, but now become a trend
&lt;/h3&gt;

&lt;p&gt;What started as a move towards greater comfort and ease of use has now become a trend due to the flexibility these tools offer. YAML/JSON allow for quick adjustments and iterations, making them ideal for the fast-paced environment of data engineering. This flexibility is particularly valuable in a field where requirements can change rapidly, and the ability to adapt is crucial.&lt;/p&gt;

&lt;h2&gt;
  
  
  Is this shift beneficial for Engineers?
&lt;/h2&gt;

&lt;p&gt;The shift towards using config files brings both opportunities and challenges for data engineers. It is true that these tools can increase productivity by simplifying complex tasks and reducing the need for extensive coding. But on the other hand, they may limit the depth of customization and optimization that can be achieved with traditional programming languages like Java and Python.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The trend towards using configuration files in data engineering reflects a broader shift towards simplicity and flexibility in the field. While this shift offers many benefits, it's essential for data engineers to maintain a balance, ensuring they have the skills and knowledge to leverage both traditional programming languages and modern configuration tools effectively. It means it can give simplicity on development, but it also requires engineers to have more deep knowledge of how this configuration affects to the pipeline under the hood.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reference
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dataengineeringcentral.substack.com/p/config-driven-pipelines" rel="noopener noreferrer"&gt;https://dataengineeringcentral.substack.com/p/config-driven-pipelines&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://juhache.substack.com/p/from-data-engineer-to-yaml-engineer-ed2" rel="noopener noreferrer"&gt;https://juhache.substack.com/p/from-data-engineer-to-yaml-engineer-ed2&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>dataengineering</category>
      <category>programming</category>
      <category>config</category>
    </item>
    <item>
      <title>build-my-own-datalake: Starting from PoC</title>
      <dc:creator>kination</dc:creator>
      <pubDate>Thu, 20 Mar 2025 07:02:25 +0000</pubDate>
      <link>https://forem.com/kination/build-my-own-datalake-part-1-367h</link>
      <guid>https://forem.com/kination/build-my-own-datalake-part-1-367h</guid>
      <description>&lt;h2&gt;
  
  
  build-your-own-x
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/codecrafters-io/build-your-own-x" rel="noopener noreferrer"&gt;build-your-own-x&lt;/a&gt; is the project which has been started from GitHub community. As you can expect through name, it aims to share code that builds well-known projects or applications(database, server framework, os, and more) from the bottom.&lt;/p&gt;

&lt;p&gt;As you know there are a common programming tip "Don't re-invent the wheel" which says that building something new from scratch is often a foolish endeavor. However, I believe this advice is only partially correct.&lt;/p&gt;

&lt;p&gt;In terms of time efficiency, developing something from the ground up can indeed be time-consuming. But it's worth noting that long-established systems often carry significant technical debt and can be challenging to modify, especially when they're widely used in production environments.&lt;/p&gt;

&lt;p&gt;So, if you have a clear purpose and compelling reasons for creating a new system from scratch, it's not necessarily a waste of time. In fact, it can be a valuable and justified approach in certain situations."&lt;/p&gt;

&lt;h2&gt;
  
  
  Motivation
&lt;/h2&gt;

&lt;p&gt;Of course, thing I'm starting here, doesn't have such a grand purpose. Currently I've worked as an data engineer for several years, and faced on file format which is specified to big data like parquet/orc.&lt;/p&gt;

&lt;h3&gt;
  
  
  datalake format
&lt;/h3&gt;

&lt;p&gt;Recently one of trend in data engineering is datalake format, which is making a table based on group of file and metadata in particular architecture. It gives flexibility on keeping data with ACID transactions, versioning, and schema enforcement.&lt;/p&gt;

&lt;p&gt;As you may know, there are already widely used datalake formats(iceberg, hudi, delta, ...). These formats are being rapidly developed with strong community support, and I'm also using them as part of my job.&lt;/p&gt;

&lt;p&gt;However, aside from using them effectively, I have always been curious about how they work internally. While they are open-source and can review through them, I believe that building one from the bottom is also effective way to understand them. &lt;/p&gt;

&lt;p&gt;That’s why I decided to start this project. Additionally, since most existing formats are based on Java/Scala, I wanted to explore whether using Rust could be an effective alternative.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to start
&lt;/h2&gt;

&lt;p&gt;So for the project, first part to make is&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;metadata&lt;/li&gt;
&lt;li&gt;core file reading/writing part&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Also, it should offer interface to communicate with data processing frameworks such as Spark/Flink, so should make JNI also.&lt;/p&gt;

&lt;h3&gt;
  
  
  Result
&lt;/h3&gt;

&lt;p&gt;Here's current full result, which named as &lt;code&gt;vine&lt;/code&gt;. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/kination/vine" rel="noopener noreferrer"&gt;https://github.com/kination/vine&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I'll describe about core part only in &lt;code&gt;Part 1&lt;/code&gt; post, and do others at next.&lt;/p&gt;

&lt;h2&gt;
  
  
  Core
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="nd"&gt;#[no_mangle]&lt;/span&gt;
&lt;span class="nd"&gt;#[allow(non_snake_case)]&lt;/span&gt;
&lt;span class="nd"&gt;#[allow(unused_variables)]&lt;/span&gt;
&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;extern&lt;/span&gt; &lt;span class="s"&gt;"C"&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;Java_io_kination_vine_VineModule_readDataFromVine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;JNIEnv&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;class&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;JClass&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="n"&gt;dir_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;JString&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;jobject&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;String&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="nf"&gt;.get_string&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;dir_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Cannot get data from dir_path"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.into&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;read_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;String&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="n"&gt;rows&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="nf"&gt;.push_str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="nf"&gt;.push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sc"&gt;'\n'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;CString&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Cannot generate CString from result"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;    
    &lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="nf"&gt;.new_string&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="nf"&gt;.to_str&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.unwrap&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;&lt;span class="nf"&gt;.expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Cannot create java string"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.into_raw&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nd"&gt;#[no_mangle]&lt;/span&gt;
&lt;span class="nd"&gt;#[allow(non_snake_case)]&lt;/span&gt;
&lt;span class="nd"&gt;#[allow(unused_variables)]&lt;/span&gt;
&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;extern&lt;/span&gt; &lt;span class="s"&gt;"C"&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;Java_io_kination_vine_VineModule_writeDataToVine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;JNIEnv&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;class&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;JClass&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;JString&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;JString&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;

    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;path_str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;String&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="nf"&gt;.get_string&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Fail getting path"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.into&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;data_str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;String&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="nf"&gt;.get_string&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Fail getting path"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.into&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Vec&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;amp;&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data_str&lt;/span&gt;&lt;span class="nf"&gt;.lines&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.collect&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="nf"&gt;write_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;path_str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Failed to write data"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Core module acts as the vital bridge between JVM and Rust engine. Using the Java Native Interface (JNI), there is &lt;code&gt;readDataFromVine&lt;/code&gt; and &lt;code&gt;writeDataToVine&lt;/code&gt; to work for data exchanging.&lt;/p&gt;

&lt;p&gt;Currently, data is passed as raw strings across the boundary. While this approach allowed for rapid prototyping, it causes overhead involved in string allocation and C-string conversion. There will be better option, but this is to make quick approach.&lt;/p&gt;

&lt;h3&gt;
  
  
  Metadata
&lt;/h3&gt;

&lt;p&gt;The metadata file is JSON-based and contains information about fields and the table name. Currently, it's stored as a single file, but additional files will be needed to support versioning and schema evolution.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"table_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"vine-test"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"fields"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"data_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"integer"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"is_required"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"data_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"string"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"is_required"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Writer
&lt;/h3&gt;

&lt;p&gt;Writing, goes as&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Read raw data which is list of string&lt;/li&gt;
&lt;li&gt;Read metadata, and check field type&lt;/li&gt;
&lt;li&gt;Match field type and data&lt;/li&gt;
&lt;li&gt;Write to file&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;On parsing metadata, it only supports 4 kinds of type, but should do it for more.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="o"&gt;...&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Metadata&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;serde_json&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;from_str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;meta_str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Failed to deserialize metadata"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;meta_fields&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="py"&gt;.fields&lt;/span&gt;&lt;span class="nf"&gt;.clone&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;schema_str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;String&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;from&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"message schema {&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;field&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="n"&gt;meta_fields&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;field_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;match&lt;/span&gt; &lt;span class="n"&gt;field&lt;/span&gt;&lt;span class="py"&gt;.data_type&lt;/span&gt;&lt;span class="nf"&gt;.as_str&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="s"&gt;"integer"&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s"&gt;"REQUIRED INT32"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s"&gt;"string"&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s"&gt;"REQUIRED BINARY"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s"&gt;"boolean"&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s"&gt;"REQUIRED BOOLEAN"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s"&gt;"double"&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s"&gt;"REQUIRED DOUBLE"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;continue&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;};&lt;/span&gt;

        &lt;span class="k"&gt;match&lt;/span&gt; &lt;span class="n"&gt;field_type&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="s"&gt;"REQUIRED BINARY"&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;schema_str&lt;/span&gt;&lt;span class="nf"&gt;.push_str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="nd"&gt;format!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"    {} {} (UTF8);&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;field_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;field&lt;/span&gt;&lt;span class="py"&gt;.name&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
            &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;schema_str&lt;/span&gt;&lt;span class="nf"&gt;.push_str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="nd"&gt;format!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"    {} {};&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;field_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;field&lt;/span&gt;&lt;span class="py"&gt;.name&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After defining types, it writes the data in &lt;code&gt;parquet&lt;/code&gt; format (of course, it can be changed)&lt;/p&gt;

&lt;p&gt;One known issue is that the current logic follows these steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Define the type of the raw data.&lt;/li&gt;
&lt;li&gt;Store the raw data in a temporary variable of that type.&lt;/li&gt;
&lt;li&gt;Write the data through the temporary variable.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;However, I believe there's a more efficient approach to streamline this process.&lt;/p&gt;

&lt;h3&gt;
  
  
  Reader
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;meta_str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;read_to_string&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"vine-test/vine_meta.json"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Failed to read vine_meta.json"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;meta&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;serde_json&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;from_str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;meta_str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                    &lt;span class="nf"&gt;.expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Failed to parse metadata JSON"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="n"&gt;directories&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;sub_entries&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;read_dir&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Cannot read path"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;se&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="n"&gt;sub_entries&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;file_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;se&lt;/span&gt;&lt;span class="nf"&gt;.expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Cannot get sub entry"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.path&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="nf"&gt;.extension&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.map_or&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;|&lt;/span&gt;&lt;span class="n"&gt;ext&lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt; &lt;span class="n"&gt;ext&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s"&gt;"parquet"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;    
                &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;fields&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;meta&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"fields"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="nf"&gt;.as_array&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
                    &lt;span class="nf"&gt;.expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"fields should be an array"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

                &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;file&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;File&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Cannot open file from file_path"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
                &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;reader&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;SerializedFileReader&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"cannot serialize file"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
                &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;iter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;reader&lt;/span&gt;&lt;span class="nf"&gt;.get_row_iter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Cannot get row iterator"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

                &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;row_result&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="n"&gt;iter&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;row_result&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;values&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;Vec&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

                        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;field&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="n"&gt;fields&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                            &lt;span class="c1"&gt;// Fields are 1-indexed in metadata, but 0-indexed in parquet&lt;/span&gt;
                            &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;col_index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;field&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="nf"&gt;.as_i64&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.unwrap_or&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nb"&gt;usize&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
                            &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;data_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;field&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"data_type"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="nf"&gt;.as_str&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"data_type should be string"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

                            &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;match&lt;/span&gt; &lt;span class="n"&gt;data_type&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                                &lt;span class="s"&gt;"integer"&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="nf"&gt;.get_int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;col_index&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.unwrap_or_default&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.to_string&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
                                &lt;span class="s"&gt;"string"&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;                                    &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="nf"&gt;.get_string&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;col_index&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.unwrap&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.clone&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
                                &lt;span class="p"&gt;},&lt;/span&gt;
                                &lt;span class="o"&gt;...&lt;/span&gt;
                                &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nn"&gt;String&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;from&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;""&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                            &lt;span class="p"&gt;};&lt;/span&gt;
                            &lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="nf"&gt;.push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
                        &lt;span class="p"&gt;}&lt;/span&gt;

                        &lt;span class="n"&gt;row_list&lt;/span&gt;&lt;span class="nf"&gt;.push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="nf"&gt;.join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;","&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
                    &lt;span class="p"&gt;}&lt;/span&gt;
                &lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This module is responsible for reconstructing structured data from fragmented files. It doesn't read a single file; it traverses the directory structure associated with a table.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;It scans for all .parquet files within the table's path, allowing for distributed data storage.&lt;/li&gt;
&lt;li&gt;By aligning the 0-indexed columns of Parquet with our 1-indexed metadata IDs, the reader ensures that data is extracted into the correct fields.&lt;/li&gt;
&lt;li&gt;Using &lt;code&gt;SerializedFileReader&lt;/code&gt;, the reader iterates through rows and joins values into a comma-separated format for the consuming application.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  After PoC
&lt;/h2&gt;

&lt;p&gt;Building datalake format from the bottom has been an exercise of understanding delicate relationship between high-level data abstractions and low-level storage efficiency. In this part, I’ve established an early-stage—data format that can be used to both Java and Rust, manage its own schema, and handle basic I/O operations using Parquet.&lt;/p&gt;

&lt;p&gt;But this is only starting point. My goal is not to replace this with current datalake format (like iceberg or hudi), but to improve this until it reaches in production level.&lt;/p&gt;

&lt;p&gt;I'll keep share advanced steps for this.&lt;/p&gt;

</description>
      <category>bigdata</category>
      <category>datalake</category>
      <category>rust</category>
      <category>buildmyownx</category>
    </item>
  </channel>
</rss>
