Forem: kination

Skill to make you slow, but to go fast

kination — Sun, 22 Mar 2026 01:24:59 +0000

Here's one skillsets I want to share.

kination / slow-slow-quick-quick

Skills for slo-mo, to go fast later.

slow-slow-quick-quick

A collection of AI assistant skills built around intentional friction. It slows down interactions to deepen user's understanding, form muscle memory, and produce work that reflects the user's genuine thinking rather than AI-generated output.

The name reflects the deliberate practice philosophy: go slow now, to go fast later.

Skills

Skill	Trigger	What it does
`slow-vibe-coding`	Implementing a feature, writing a function, solving a coding problem	Refuses to write implementation code. Brainstorms patterns, presents at least two concrete approaches with trade-offs, provides comment-only skeletons, and guides you to write it yourself.
`slow-vibe-sw-architect`	System design, architecture decisions, technical choices	Refuses to give an immediate answer. Asks trade-off questions and failure scenarios until you arrive at a decision you own. Produces an ADR.
`slow-dev-research`	Technology selection, stack comparisons, engineering trade-off questions	Refuses to give a recommendation upfront. Surfaces research and real-world cases with sources, asks requirements questions one at a time, and

…

View on GitHub

LLMs change how fast software is being made. Tasks that used to take days now finishes in minutes. This speed creates a specific problem for the engineers who's in charge. It is remarkably easy to become a spectator in your own project.

I have felt this disconnection during recent work. When AI implements a feature, the growth of codebase is so fast, that sometimes human brain can barely keep up with. You are directing the work, but implementation details can bypass your understanding. You will feel that the work end up with a gap between what the system does and what you actually know about its inner workings.

Don't worry. It's not your fault, and it does not mean you are incompetent engineer. Strictly, in terms of the "speed" of software creation and problem solving, no human can even go close to current AI’s . That is how far AI has come today.

Make slowly, to go fast later.

True productivity follows a rhythm of "Slow, Slow, Quick, Quick."

This means alternating 'time to think' with implementation speed performance. Slowing down to analyze a design choice is not a waste of time. It is necessary investment. Building a clear mental model during the "slow" phases ensures the "quick" phases stay grounded. Without this balance, you just make shallow decisions that lead to huge technical debt.

This is the motivation for this skill.

Here's how to start for claude-code users:

$ /plugin marketplace add kination/slow-slow-quick-quick
$ /plugin install slow-engineering@slow-slow-quick-quick

(will make for other LLMs soon...)

Strategic Implementation with `slow-vibe-coding`

The slow-vibe-coding skill is one way to stay involved on coding. Typical assistants dump blocks of code, but this skill follows a strict protocol. AI is prohibited from writing implementation logic. It guides you through the problem and the patterns, then it stops.

The AI generates a skeleton of comments and a "pass" statement. Though they guide with well-made comments or making base code by your description, you will have change to involve in every output.

This builds muscle memory and maintains ownership. By the end, you have to explain details in review process. The AI becomes a high level director, not a code generator.

Architectural Integrity with `slow-vibe-sw-architect`

Design is usually the first thing lost when you move too fast. The slow-vibe-sw-architect skill enforces a pre-implementation step. It prohibits recommendations until you evaluate three tradeoffs and two failure scenarios. This forced delay explores scale, consistency, and complexity.

The result is a co-authored "Architecture Decision Record" (ADR). Documenting the "why" keeps the system coherent. It prevents AI from building something no human can debug.

So, here's the summary of these 2.

And one more, `slow-dev-research`

Technological decision often get made on familiarity, not fit. Result of selecting 'message queue' or 'database' under time pressure usually heads to choosing what team already knows. The research phase gets skipped because it feels like overhead. This is where a lot of technical debt originates.

Skill "slow-dev-research" adds friction before choices get made. The AI collects benchmarks, cases from real world, and known trade-offs with sources. After that it asks about your specific constraints one at a time. It can be expected traffic, team experience, consistency requirements, cost model. It refuses to give a recommendation until you have surfaced your own requirements.

The output is a research document with a trade-off map you fill out and a conclusion section you write yourself. It feeds directly into slow-vibe-sw-architect when a design decision follows.

Conclusion

These are the starting items of "slow" skills. I am planning more to cover the full lifecycle 'slowly', from research to verification.

Using LLMs is inevitable. I love coding, and still believe in the competitive edge of human-made code. But even so, AI is simply too fast and too productive to ignore.

But long term productivity requires more than raw generation. Working slow and deep at critical points will improve your understanding, to move quick later. Catching up may be impossible, but falling behind is not an option to remain as an engineer. Our mastery of the architecture must grow alongside the code it produces.

build-my-own-datalake: Improve metadata with caching

kination — Sat, 28 Feb 2026 06:37:35 +0000

This is the next story of following post: https://dev.to/kination/build-my-own-datalake-part-1-367h

Building a High-Performance Metadata System with Global Caching

Caching schema metadata at the JNI boundary to eliminate per-write filesystem reads

I've made a goal for project vine as write-optimized data lake format. And what I didn't expect was that, reading a small JSON file would be the bottleneck.

My initial implementation was spending a significant chunk of each write just loading a schema definition from disk—repeatedly, on every operation.

For a system handling thousands of writes per second, this compounds fast.

The Initial Implementation

Like many data lake formats, Vine uses a metadata file to define table schemas:

{
  "table_name": "user_events",
  "fields": [
    { "id": 1, "name": "user_id",     "data_type": "long",   "is_required": true },
    { "id": 2, "name": "event_type",  "data_type": "string", "is_required": true }
  ]
}

Both read and write paths load this file on every call:

fn load_metadata(base_path: &str) -> Result<Metadata> {
    let meta_path = Path::new(base_path).join("vine_meta.json");
    let content = fs::read_to_string(meta_path)?;  // filesystem hit every time
    serde_json::from_str(&content).map_err(Into::into)
}

fn write_user_events(base_path: &str, events: Vec<UserEvent>) -> Result<()> {
    let metadata = load_metadata(base_path)?;  // called on EVERY write

    let today = chrono::Utc::now().format("%Y-%m-%d").to_string();
    let date_dir = Path::new(base_path).join(&today);
    fs::create_dir_all(&date_dir)?;

    let output_file = date_dir.join(format!(
        "data_{}.parquet",
        chrono::Utc::now().format("%H%M%S_%f")
    ));
    let parquet_schema = metadata_to_parquet_schema(&metadata);
    let mut writer = parquet::file::writer::SerializedFileWriter::new(
        fs::File::create(output_file)?,
        parquet_schema.clone(),
        Default::default(),
    )?;
    for event in events {
        writer.write(event_to_parquet_row(&event, &metadata)?)?;
    }
    writer.close()?;
    Ok(())
}

The naive approach which I've decided first was just reading vine_meta.json on every write operation. It was doing disk reads for data that never changed.

It's just a small JSON file. Why does it matter?

That's a natural first reaction. This metadata JSON file is maybe 1-2KB. OS page cache should keep it hot. Why would this make it slow?

It turns out the file size is almost irrelevant. What's expensive is everything around the read.

Syscall overhead

Every fs::read_to_string() call goes through at least open(), read(), and close()—three syscalls minimum, each of which requires a user-to-kernel context switch. System calls have fixed overhead regardless of how much data they transfer. A study measuring Linux syscall overhead puts the baseline at roughly 1-4 microseconds per call on modern hardware, before any I/O happens.

At 1000 writes/second, that's continuous context-switching pressure. A small file just means you're paying that fixed overhead without getting much data for it.

Page cache doesn't eliminate overhead

Even if OS caches file contents, you still pay for:

VFS path resolution (walking the directory tree)
dentry and inode lookups in the kernel
Lock acquisition on the file
JSON deserialization (on every call, even on a page-cache hit)

Red Hat documentation on I/O performance factors notes that small files generate disproportionate overhead relative to data transferred, because fixed costs of the filesystem stack dominate. Research from CMU's Parallel Data Lab on scaling file system metadata performance shows that dcache and inode lookup latency is often the limiting factor for metadata-heavy workloads, not disk bandwidth.

Industry validation: this is a known problem

This isn't unique to Vine. Major data systems have all hit the same wall:

Apache Iceberg was explicitly designed to avoid Hive Metastore's pattern of fetching partition locations (a metadata call) followed by file system lookups (a storage call) on every query. The Iceberg vs Delta Lake metadata comparison explains how Iceberg's manifest files cache partition-to-file mappings to avoid this double-lookup pattern.
Google BigQuery added metadata caching for external tables specifically because "listing millions of files from external data sources can take several minutes" without it. Their documentation notes that caching metadata avoids repeated round-trips to external storage on every query.
Microsoft Azure Files added metadata caching for SMB workloads and observed up to 55% latency reduction and 2-3x more consistent response times for workloads with frequent metadata access.

Same pattern shows up everywhere: metadata is small, but accessing it repeatedly at high frequency creates real overhead. Solution is always the same: cache in memory and skip filesystem entirely on the hot path.

Solution: three-tier caching

I ended up with three caching layers. Here's the description.

Layer 1: Global In-Memory Cache — lazy_static + Mutex<HashMap>, shared across all JNI calls for the lifetime of the process. Handles the vast majority of operations. Schemas rarely change, so caching them costs almost nothing.
Layer 2: Local Disk Cache — _meta/schema.json per table, updated asynchronously. Covers cold-start recovery. When the process restarts, I skip re-reading the original metadata file and go straight to disk cache.
Layer 3: Vortex File Inference — Read schema directly from data file headers, merging if multiple versions exist. The fallback that always works. Even without vine_meta.json, I can infer schema from Vortex data files.

Implementation: global cache layer

At JNI boundary between Spark and Rust, I can share a single cache across all operations. That's what makes the difference.

lazy_static! {
    static ref WRITER_CACHE: Mutex<HashMap<String, WriterCache>> =
        Mutex::new(HashMap::new());
}

pub fn get_writer_metadata(path: &str) -> Result<Metadata> {
    let mut cache = WRITER_CACHE.lock().unwrap();

    if let Some(cached) = cache.get(path) {
        return Ok(cached.metadata.clone());  // cache hit - no filesystem access
    }

    let writer_cache = WriterCache::new(path.into())?;
    let metadata = writer_cache.metadata.clone();
    cache.insert(path.to_string(), writer_cache);
    Ok(metadata)
}

A few things worth noting: lazy_static! creates the cache exactly once; Mutex<HashMap> is simpler than RwLock and sufficient for my access patterns; metadata is small (~1KB), so cloning is cheaper than reference counting overhead.

The writer path then becomes:

pub fn write_batch(&mut self, rows: &[&str]) -> Result<()> {
    let metadata = get_writer_metadata(self.path)?;  // in-memory lookup
    if self.config.validate_every_write {
        validate_schema_match(rows, &metadata)?;
    }
    write_vortex_file(&path, &metadata, rows)?;
    Ok(())
}

pub struct WriterConfig {
    pub enable_metadata_cache: bool,    // Default: true
    pub require_metadata_file: bool,    // Default: true (strict mode)
    pub validate_every_write: bool,     // Default: false (for performance)
}

I validate once on the first write, then trust cache after that. For streaming workloads where schemas are stable, per-write validation isn't worth the cost.

Implementation: schema-on-read fallback

For readers, I need more flexibility. A missing metadata file shouldn't crash the read path.

pub fn new_with_fallback(base_path: PathBuf) -> Result<Self, Error> {
    let meta_path = base_path.join("vine_meta.json");

    // Strategy 1: vine_meta.json (explicit schema - fastest)
    if meta_path.exists() {
        return Self::new(base_path);
    }

    let cache_path = base_path.join("_meta/schema.json");

    // Strategy 2: Cached schema (disk cache - fast)
    if cache_path.exists() {
        if let Ok(metadata) = Metadata::load(&cache_path) {
            return Ok(Self { metadata, base_path });
        }
    }

    // Strategy 3: Infer from Vortex files (slowest but always works)
    let metadata = Metadata::infer_from_vortex(&base_path)?;

    // Cache asynchronously for next read
    let (metadata_clone, cache_path_clone) = (metadata.clone(), cache_path.clone());
    std::thread::spawn(move || { let _ = metadata_clone.save_to_cache(&cache_path_clone); });

    Ok(Self { metadata, base_path })
}

That async cache write mattered more than expected. Waiting for disk write was adding unnecessary latency to first read. A background thread handles it while I return results immediately—and if it fails, I just re-infer on the next miss.

Schema inference reads only file headers—not full data rows—and stops early once a complete schema is found:

pub fn infer_from_vortex<P: AsRef<Path>>(base_path: P) -> Result<Metadata> {
    let mut all_schemas = Vec::new();
    for date_dir in find_date_directories(&base_path)? {
        for vortex_file in find_vortex_files(&date_dir)? {
            let (dtype, _) = read_vortex_file(&vortex_file)?;
            all_schemas.push(dtype_to_metadata(&dtype, "inferred_table"));
        }
    }
    merge_schemas(all_schemas)  // union of all fields seen
}

Implementation: handling schema mismatches

Readers must handle the case where the expected schema (from metadata) doesn't match the actual schema (from data files). Instead of failing, I use lenient matching: iterate over expected fields, extract value if present, fill with a type-appropriate default if not.

pub fn array_to_csv_rows_lenient(
    array: &StructArray,
    expected_schema: &Metadata
) -> Result<Vec<String>> {
    let mut rows = Vec::new();
    for row_idx in 0..array.len() {
        let values: Vec<String> = expected_schema.fields.iter().map(|field| {
            match array.field_by_name(&field.name) {
                Some(col) => extract_value(col, row_idx),
                None      => default_value_for_type(&field.data_type),
            }
        }).collect();
        rows.push(values.join(","));
    }
    Ok(rows)
}

Old readers skip unknown fields; new readers fill missing ones with type-appropriate defaults. Both directions work without coordination.

Operational impact

The fallback chain has some useful operational properties worth calling out.

Schema changes don't require coordination: writer creates new files with an updated schema, readers pick up new fields automatically via inference, no registry to update, no downtime. If vine_meta.json is accidentally deleted, reads still work (fallback to inference) while writes fail fast—and the metadata can be rebuilt from data files with vine schema rebuild <table_path>. That's a better failure mode than formats that require metadata for reads (e.g., Delta Lake transaction log).

Validation overhead also stays bounded: schema is validated once on Writer::new(), then cached. All subsequent writes in that writer's lifetime skip validation entirely.

What I've learned

1. Cache at the right granularity

I first tried per-writer instance caching. It helped within a single writer but not enough: Spark creates many short-lived writers (one per partition), and each paid the full cold-start cost.

Moving cache to global scope meant it survived the writer lifecycle. Speedup came from paying I/O cost once and spreading it across all subsequent operations, not just within a single instance.

2. Optimize for the common case

Validating schema on every write adds overhead per operation. Validating once adds that overhead once. Since schemas in streaming workloads almost never change between writes, right default is: validate once, trust cache.

3. Separate write and read semantics

Writers are strict: they require vine_meta.json at table creation, validate on the first write, and fail fast on mismatches. Readers are lenient: they try the metadata file first, fall back to the disk cache, fall back to file inference, and handle missing fields gracefully. The same semantics don't work for both access patterns.

4. Use file formats as schema sources

Vortex files (like Parquet) are self-describing—schema lives in file headers. That means no external schema registry to operate, and you can't lose schema without losing data. Files themselves are source of truth.

Comparison: why global cache matters

Two approaches side by side:

No caching will make filesystem access on every write:

pub fn write_batch(&mut self, rows: &[&str]) -> Result<()> {
    let metadata = Metadata::load("vine_meta.json")?;  // hit disk EVERY TIME
    write_vortex_file(&path, &metadata, rows)?;
    Ok(())
}

Global caching makes in-memory lookup, shared across all tasks:

pub fn write_batch(&mut self, rows: &[&str]) -> Result<()> {
    let metadata = get_writer_metadata(self.path)?;  // HashMap lookup
    write_vortex_file(&path, &metadata, rows)?;
    Ok(())
}

Cache survives the writer instance lifecycle. First write across any writer pays I/O cost once; everything after is a HashMap lookup.

Per-instance caching (caching inside a Writer struct) is the obvious middle ground, but Spark creates many short-lived writers (one per partition)—each new instance would still pay the cold-start cost.

Trade-offs and limitations

What I gave up

Strong schema enforcement: I validate once, not on every write. Schema errors may not surface immediately. There's a configurable validate_every_write flag for stricter environments.

Instant schema change detection: writers may briefly use a stale cached schema after a metadata update. File-watching cache invalidation is planned.

What I gained

Metadata lookup goes from a filesystem operation to a HashMap lookup on the hot path. No external schema registry means one fewer system to operate. Reads still work if the metadata file goes missing, fallback chains rebuild automatically, and there's no single point of failure.

What's next

A few things I'm planning:

Cache invalidation: Watch vine_meta.json with notify and invalidate the global cache on file change. Right now a schema update requires a process restart to take effect.
TTL-based expiration: For long-running processes, stale cache is a real risk. A configurable TTL (default: 1 hour) with background refresh should cover most cases.
Cache statistics: Hit rate and miss count exposed as metrics for monitoring. Cache effectiveness is currently invisible.

Hope to have a change to describe about these.

Conclusion

Underlying issue was that metadata doesn't change between writes. Every disk load was wasted I/O—not because the file was large, but because syscall overhead and deserialization cost compound at high write rates. Cache globally, validate once, and let file format carry schema when metadata file isn't there.

Code repository

Full implementation: Vine GitHub Repository

vine-core/src/global_cache.rs — global cache implementation
vine-core/src/reader_cache.rs — schema-on-read fallback chain
vine-core/src/metadata.rs — metadata inference from Vortex files

References

Measurements of system call performance and overhead — Linux syscall latency benchmarks
Factors affecting I/O and file system performance — Red Hat Enterprise Linux documentation
Scaling file system metadata performance — CMU Parallel Data Lab research paper
Iceberg vs Delta Lake metadata indexing — Metadata comparison between Apache Iceberg and Delta Lake
Metadata caching for BigQuery external tables — Google Cloud documentation
Accelerate metadata-heavy workloads with metadata caching — Microsoft Azure Storage Blog This is next chapter of following https://dev.to/kination/build-my-own-datalake-part-1-367h

The Problem: Metadata is the Hidden Bottleneck

In high-throughput streaming pipelines, every millisecond counts. We discovered that our initial implementation of Vine (a write-optimized data lake format) was spending 300ms per write just reading a tiny JSON metadata file.

For a system designed to handle 1000+ writes per second, this was unacceptable.

The Initial Implementation

Like many data lake formats, Vine uses a metadata file to define table schemas:

{
  "table_name": "user_events",
  "fields": [
    {
      "id": 1,
      "name": "user_id",
      "data_type": "long",
      "is_required": true
    },
    {
      "id": 2,
      "name": "event_type",
      "data_type": "string",
      "is_required": true
    }
  ]
}

Here's how this metadata drives Parquet read/write operations in a data lake pattern:

use serde_json;
use std::fs;
use std::path::Path;
use parquet::file::reader::FileReader;
use parquet::file::writer::FileWriter;

// 1. Load metadata from vine_meta.json
fn load_metadata(base_path: &str) -> Result<Metadata> {
    let meta_path = Path::new(base_path).join("vine_meta.json");
    let content = fs::read_to_string(meta_path)?;
    let metadata: Metadata = serde_json::from_str(&content)?;
    Ok(metadata)
}

// 2. Read Parquet files using the schema from metadata
fn read_user_events(base_path: &str) -> Result<Vec<UserEvent>> {
    let metadata = load_metadata(base_path)?;  // 80ms - loads schema

    // Scan date-partitioned directories (2024-01-24/, 2024-01-25/, ...)
    let mut all_events = Vec::new();
    for date_dir in find_date_directories(base_path)? {
        for parquet_file in find_parquet_files(&date_dir)? {
            // Read Parquet file with schema validation
            let file = fs::File::open(parquet_file)?;
            let reader = parquet::file::reader::SerializedFileReader::new(file)?;

            // Validate schema matches metadata
            validate_schema(&reader.metadata().file_metadata().schema(), &metadata)?;

            // Read rows
            for row_group in reader.get_row_iter(None)? {
                let event = parse_row(row_group, &metadata)?;
                all_events.push(event);
            }
        }
    }
    Ok(all_events)
}

// 3. Write Parquet files with date partitioning (data lake pattern)
fn write_user_events(base_path: &str, events: Vec<UserEvent>) -> Result<()> {
    let metadata = load_metadata(base_path)?;  // 80ms - loads schema

    // Create date-partitioned output path
    let today = chrono::Utc::now().format("%Y-%m-%d").to_string();
    let date_dir = Path::new(base_path).join(&today);
    fs::create_dir_all(&date_dir)?;

    // Generate filename with microsecond precision
    let timestamp = chrono::Utc::now().format("%H%M%S_%f").to_string();
    let output_file = date_dir.join(format!("data_{}.parquet", timestamp));

    // Convert metadata to Parquet schema
    let parquet_schema = metadata_to_parquet_schema(&metadata);

    // Write Parquet file
    let file = fs::File::create(output_file)?;
    let mut writer = parquet::file::writer::SerializedFileWriter::new(
        file,
        parquet_schema.clone(),
        Default::default()
    )?;

    // Write rows using schema from metadata
    for event in events {
        let row = event_to_parquet_row(&event, &metadata)?;
        writer.write(row)?;
    }

    writer.close()?;
    Ok(())
}

The naive approach: Read vine_meta.json on every write operation.

This was a classic case of premature I/O. We were doing disk reads for data that never changed.

The Solution: Three-Tier Caching Strategy

We implemented an aggressive caching strategy that caches metadata at three levels:

┌─────────────────────────────────────────┐
│ Layer 1: Global In-Memory Cache        │
│  - lazy_static + Mutex<HashMap>         │
│  - Shared across ALL JNI calls          │
│  - Lifetime: Process lifetime           │
│  - Lookup time: 0.88ms                  │
└─────────────────────────────────────────┘
              ↓ (cache miss)
┌─────────────────────────────────────────┐
│ Layer 2: Local Disk Cache              │
│  - _meta/schema.json per table          │
│  - Updated asynchronously               │
│  - Lookup time: 10-20ms                 │
└─────────────────────────────────────────┘
              ↓ (cache miss)
┌─────────────────────────────────────────┐
│ Layer 3: Vortex File Inference         │
│  - Read schema from data files          │
│  - Merge if multiple versions           │
│  - Lookup time: 50-80ms                 │
│  - Always works (schema-on-read)        │
└─────────────────────────────────────────┘

Why Three Layers?

Layer 1 (Global Cache): The fast path for 99.9% of operations. Since schemas rarely change, we cache them in memory for the lifetime of the process.

Layer 2 (Disk Cache): Enables fast cold-start recovery. When the process restarts, we don't need to re-read the original metadata file or infer from data files.

Layer 3 (File Inference): The ultimate fallback. Even if vine_meta.json is missing or corrupted, we can always infer the schema from Vortex data files themselves.

Implementation: Global Cache Layer

The key breakthrough was realizing that in the JNI boundary between Spark and Rust, we could share a global cache across all operations.

Global Cache Implementation

use lazy_static::lazy_static;
use std::collections::HashMap;
use std::sync::Mutex;

lazy_static! {
    static ref READER_CACHE: Mutex<HashMap<String, ReaderCache>> =
        Mutex::new(HashMap::new());

    static ref WRITER_CACHE: Mutex<HashMap<String, WriterCache>> =
        Mutex::new(HashMap::new());
}

pub fn get_writer_metadata(path: &str) -> Result<Metadata> {
    let mut cache = WRITER_CACHE.lock().unwrap();

    // Fast path: Check global cache first
    if let Some(cached) = cache.get(path) {
        return Ok(cached.metadata.clone());  // 0.88ms - cache hit!
    }

    // Slow path: Load from disk and cache
    let writer_cache = WriterCache::new(path.into())?;
    let metadata = writer_cache.metadata.clone();
    cache.insert(path.to_string(), writer_cache);

    Ok(metadata)
}

Key insights:

Lazy initialization: Use lazy_static! to create a global cache that's initialized once
Fine-grained locking: Use Mutex<HashMap> instead of RwLock (simpler, sufficient for our access patterns)
Clone is cheap: Metadata is small (~1KB), cloning is faster than reference counting overhead

Writer Path with Global Cache

pub fn write_batch(&mut self, rows: &[&str]) -> Result<()> {
    // Fast path: Use cached metadata (97x faster!)
    let metadata = self.cached_metadata.as_ref()
        .ok_or("Metadata cache not initialized")?;

    // Optional: Validate schema (disabled by default for performance)
    if self.config.validate_every_write {
        validate_schema_match(rows, metadata)?;
    }

    // Write to Vortex with schema
    write_vortex_file(&path, metadata, rows)?;
    Ok(())
}

Configuration options:

pub struct WriterConfig {
    pub enable_metadata_cache: bool,    // Default: true
    pub require_metadata_file: bool,    // Default: true (strict mode)
    pub validate_every_write: bool,     // Default: false (for performance)
}

By default, we validate once (on cache miss) and then trust the cache for all subsequent writes. This trades off strict validation for performance—a worthwhile trade for streaming workloads where schemas are stable.

Implementation: Schema-on-Read Fallback

For readers, we need more flexibility. A missing metadata file shouldn't crash the read path.

Fallback Chain Implementation

pub fn new_with_fallback(base_path: PathBuf) -> Result<Self, Error> {
    let meta_path = base_path.join("vine_meta.json");

    // Strategy 1: vine_meta.json (explicit schema - fastest)
    if meta_path.exists() {
        return Self::new(base_path);  // Load from JSON (~10ms)
    }

    let cache_path = base_path.join("_meta/schema.json");

    // Strategy 2: Cached schema (disk cache - fast)
    if cache_path.exists() {
        if let Ok(metadata) = Metadata::load(&cache_path) {
            return Ok(Self {
                metadata,
                base_path,
            });
        }
    }

    // Strategy 3: Infer from Vortex files (slowest but always works)
    let metadata = Metadata::infer_from_vortex(&base_path)?;

    // Cache asynchronously for next read
    let metadata_clone = metadata.clone();
    let cache_path_clone = cache_path.clone();
    std::thread::spawn(move || {
        let _ = metadata_clone.save_to_cache(&cache_path_clone);
    });

    Ok(Self {
        metadata,
        base_path,
    })
}

Why asynchronous caching?

We discovered that waiting for the cache write (5-10ms) was adding unnecessary latency to the first read. By spawning a background thread, we:

Return results to the user immediately
Cache the schema for the next operation
Avoid blocking on disk I/O

This is safe because:

Cache writes are idempotent
Cache is only a performance optimization, not required for correctness
Worst case: We re-infer schema on next cache miss

Schema Inference from Vortex Files

When all else fails, we can always read the schema directly from Vortex data files:

pub fn infer_from_vortex<P: AsRef<Path>>(base_path: P) -> Result<Metadata> {
    let mut all_schemas = Vec::new();

    // Scan date-partitioned directories (YYYY-MM-DD/)
    for date_dir in find_date_directories(&base_path)? {
        for vortex_file in find_vortex_files(&date_dir)? {
            // Read Vortex file header (cheap - no full scan needed)
            let (dtype, _) = read_vortex_file(&vortex_file)?;
            let schema = dtype_to_metadata(&dtype, "inferred_table");
            all_schemas.push(schema);
        }
    }

    // Merge all schemas (union of fields)
    let merged = merge_schemas(all_schemas)?;
    Ok(merged)
}

Performance characteristics:

Reading Vortex headers: ~1-2ms per file
Schema merging: ~5ms for 100 files
Total: 50-80ms (still faster than cold disk reads on many filesystems)

Optimization: We only scan until we find a complete schema. If the first file has all expected fields, we stop early.

Implementation: Handling Schema Mismatches

Readers must handle the case where the expected schema (from metadata) doesn't match the actual schema (from data files).

Lenient Schema Matching

pub fn array_to_csv_rows_lenient(
    array: &StructArray,
    expected_schema: &Metadata
) -> Result<Vec<String>> {
    let actual_fields = array.field_names();
    let expected_fields: HashSet<_> = expected_schema.fields
        .iter()
        .map(|f| f.name.as_str())
        .collect();

    let mut rows = Vec::new();

    for row_idx in 0..array.len() {
        let mut values = Vec::new();

        for expected_field in &expected_schema.fields {
            if let Some(column) = array.field_by_name(&expected_field.name) {
                // Field exists in data: Extract value
                let value = extract_value(column, row_idx);
                values.push(value);
            } else {
                // Field missing: Use default/null
                values.push(default_value_for_type(&expected_field.data_type));
            }
        }

        rows.push(values.join(","));
    }

    Ok(rows)
}

fn default_value_for_type(data_type: &str) -> String {
    match data_type {
        "integer" | "long" | "byte" | "short" => "0".to_string(),
        "float" | "double" => "0.0".to_string(),
        "boolean" => "false".to_string(),
        "string" => "".to_string(),  // Empty string for missing text
        _ => "".to_string(),
    }
}

This provides:

Backward compatibility: Old readers can read new data (ignore unknown fields)
Forward compatibility: New readers can read old data (fill missing fields with defaults)

Performance Results

Benchmark Setup

Dataset: 1M rows, 4 columns (id: int, name: string, age: int, score: double)
Hardware: M1 Mac, 16GB RAM
Measurement: 100 repeated operations

Write Performance

Configuration	Time (1M rows)	Throughput	Speedup
No cache (baseline)	12.0s	83K rows/sec	1x
With global cache	1.8s	555K rows/sec	6.7x

Breakdown of 12.0s baseline:

JSON parsing: 8.5s (71%)
CSV conversion: 2.0s (17%)
Vortex write: 1.5s (12%)

Breakdown of 1.8s cached:

Metadata lookup: 0.088s (5%)
CSV conversion: 0.7s (39%)
Vortex write: 1.0s (56%)

The cache eliminated 8.4 seconds of pure JSON parsing overhead!

Read Performance

Configuration	Time (100 calls)	Per-call	Speedup
No cache (baseline)	8000ms	80ms/call	1x
With global cache	88ms	0.88ms/call	91x

Per-call breakdown (no cache):

File open: 15ms
JSON parse: 50ms
Metadata object creation: 15ms

Per-call breakdown (cached):

HashMap lookup: 0.3ms
Clone: 0.58ms

Memory Overhead

Metric	Value
Metadata size	~1KB per table
Cache overhead	<1MB for 1000 tables
Memory amplification	Negligible (<0.1% of heap)

The cache is essentially free in terms of memory.

Lessons Learned

1. Cache at the Right Granularity

Initial attempt: Per-writer instance caching

Helped: Reduced redundant reads within a single writer
Didn't help: JNI overhead still present for each writer creation

Breakthrough: Global cache shared across all JNI calls

Result: 97x speedup because cache survives writer lifecycle

Key insight: In high-throughput systems, amortize I/O across all operations, not just within a single instance.

2. Optimize for the Common Case

Observation: 99.9% of writes use the same schema.

Initial design: Validate schema on every write

Cost: 5-10ms per write
Benefit: Catch schema errors immediately

Optimized design: Validate once, trust cache

Cost: 5-10ms once (on cache miss)
Benefit: 0ms for all subsequent writes

Key insight: Assume schemas are stable, handle evolution as the exception.

3. Separate Write and Read Semantics

Writers (strict mode):

Require vine_meta.json at table creation
Validate schema on first write
Fail fast on mismatches

Readers (lenient mode):

Try vine_meta.json first
Fallback to cache
Fallback to file inference
Handle missing fields gracefully

Key insight: Different access patterns need different guarantees. Don't force one-size-fits-all.

4. Use File Formats as Schema Sources

Vortex (like Parquet) files are self-describing. The schema is embedded in the file header.

Implication: We don't need an external schema registry. The data files themselves are the source of truth.

Benefit:

Resilience (can't lose schema if you have the data)
Simplicity (one less system to operate)
Performance (schema is co-located with data)

Key insight: Modern columnar formats are self-describing. Trust the format, don't duplicate metadata.

Operational Impact

Zero-Downtime Schema Changes

Because readers use a fallback chain, we can update schemas without coordination:

Writer creates new files with updated schema
Readers detect new fields automatically via inference
No global schema registry to update
No downtime required

Resilient to Metadata Loss

If vine_meta.json is accidentally deleted:

Reads still work (fallback to inference)
Writes fail (strict mode requirement)
Rebuild metadata from data files: vine schema rebuild <table_path>

This is much better than formats that require metadata for reads (e.g., Delta Lake transaction log).

Write-Optimized Validation

Schema validation happens once per writer instance, not once per write:

// First write: Validate schema
Writer::new(path) -> Loads metadata (10ms), validates, caches

// Subsequent writes: Trust cache
writer.write(batch) -> Uses cached metadata (0.88ms)

Validation overhead: <1% of total write time (only on first write).

Comparison: Why Global Cache Matters

Let's compare different caching strategies:

No Caching (Baseline)

pub fn write_batch(&mut self, rows: &[&str]) -> Result<()> {
    let metadata = Metadata::load("vine_meta.json")?;  // 80ms EVERY TIME
    write_vortex_file(&path, &metadata, rows)?;
    Ok(())
}

Performance: 300ms per write

Per-Instance Caching

pub struct Writer {
    cached_metadata: Option<Metadata>,
}

pub fn write_batch(&mut self, rows: &[&str]) -> Result<()> {
    if self.cached_metadata.is_none() {
        self.cached_metadata = Some(Metadata::load("vine_meta.json")?);
    }
    let metadata = self.cached_metadata.as_ref().unwrap();
    write_vortex_file(&path, metadata, rows)?;
    Ok(())
}

Performance:

First write: 300ms
Subsequent writes: 3ms
Problem: Every new writer instance pays 300ms cost

In Spark, we create many short-lived writer instances (one per partition). Per-instance caching helped, but not enough.

Global Caching (Our Approach)

lazy_static! {
    static ref WRITER_CACHE: Mutex<HashMap<String, WriterCache>> =
        Mutex::new(HashMap::new());
}

pub fn write_batch(&mut self, rows: &[&str]) -> Result<()> {
    let metadata = get_writer_metadata(self.path)?;  // 0.88ms from global cache
    write_vortex_file(&path, &metadata, rows)?;
    Ok(())
}

Performance:

First write (any writer): 300ms (loads and caches)
All subsequent writes (any writer): 3ms
Win: Cache survives writer instance lifecycle

Speedup: 97x improvement for steady-state writes.

Trade-offs and Limitations

What We Gave Up

1. Strong Schema Enforcement

Pro: Fast writes
Con: May not catch schema errors immediately
Mitigation: Optional per-write validation (configurable)

2. Instant Schema Change Detection

Pro: Zero coordination overhead
Con: Writers may use stale cached schema
Mitigation: Cache invalidation on metadata file update (planned)

3. Fine-Grained Version Control

Pro: Simple implementation
Con: No explicit schema history tracking
Mitigation: Version log (planned for v0.4.0)

What We Gained

1. Write Latency

97x faster metadata access
6.7x end-to-end write throughput

2. Operational Simplicity

No external schema registry to operate
Self-describing data files
Automatic fallback chains

3. Resilience

Reads work even if metadata is missing
Automatic cache rebuilding
No single point of failure

Future Optimizations

Short-Term (v0.3.0)

Cache Invalidation:

// Watch vine_meta.json for changes
pub fn watch_metadata_file(path: &str) -> Result<()> {
    let watcher = notify::watcher(tx, Duration::from_secs(1))?;
    watcher.watch(path, RecursiveMode::NonRecursive)?;

    // Invalidate cache on file change
    if let Ok(event) = rx.recv() {
        invalidate_cache(path);
    }
}

Cache Warming:

// Pre-load frequently used tables into cache
pub fn warm_cache(table_paths: &[&str]) -> Result<()> {
    for path in table_paths {
        let _ = get_reader_metadata(path)?;  // Load into global cache
    }
}

Medium-Term (v0.4.0)

TTL-Based Cache Expiration:

pub struct CachedMetadata {
    metadata: Metadata,
    loaded_at: SystemTime,
    ttl: Duration,  // Default: 1 hour
}

impl CachedMetadata {
    pub fn is_expired(&self) -> bool {
        SystemTime::now()
            .duration_since(self.loaded_at)
            .unwrap() > self.ttl
    }
}

Cache Statistics:

pub struct CacheStats {
    hits: AtomicU64,
    misses: AtomicU64,
    evictions: AtomicU64,
}

// Expose metrics for monitoring
pub fn get_cache_hit_rate() -> f64 {
    let hits = CACHE_STATS.hits.load(Ordering::Relaxed);
    let misses = CACHE_STATS.misses.load(Ordering::Relaxed);
    hits as f64 / (hits + misses) as f64
}

Conclusion

By implementing a three-tier caching strategy with a global cache at the JNI boundary, we achieved:

97x faster metadata lookups (80ms → 0.88ms)
6.7x faster end-to-end writes (12s → 1.8s for 1M rows)
Zero operational overhead (no external systems required)
Resilient fallbacks (schema-on-read always works)

The key lessons:

Cache at the right level: Global > Per-instance > No cache
Optimize for the common case: Schemas rarely change
Separate read/write semantics: Strict writes, lenient reads
Trust file formats: Self-describing data > external schemas

For write-optimized data lakes, metadata access should be invisible. Every millisecond spent on schema lookups is a millisecond stolen from actual data processing.

Code Repository

Full implementation available at: Vine GitHub Repository

Key files:

vine-core/src/global_cache.rs - Global cache implementation
vine-core/src/reader_cache.rs - Schema-on-read fallback chain
vine-core/src/metadata.rs - Metadata inference from Vortex files

Fundamental matters more in AI era

kination — Sat, 14 Feb 2026 12:36:03 +0000

Large Language Models (LLMs) have become an inseparable part of the modern computing landscape. Software development is no exception. We’ve moved past the stage where AI only generates simple boilerplate; today, LLMs are capable of implementing complex logic and architecting entire applications.

Naturally, this raises a pressing question for those of us in the industry: Is there a future for software developers, or are we being phased out?

Shift is Real

Some might argue that "no-code" movements have always tried to replace developers, only to fail because a professional is always eventually needed. However, the current shift feels different. This is the most significant paradigm shift I’ve witnessed since I started earning a living through code.

Ignoring LLMs in favor of "pure" manual coding will soon lead to a dead end. But there is a flip side: if you simply rely on LLMs to write code while acting as a mere "code checker," you become easily replaceable.

Two Paths Forward: Strategy in the Age of Autopilot

To stay relevant, developers must choose distinct paths. The middle ground is rapidly disappearing.

The Architect of Experience : Use AI as a force multiplier to solve human problems. Your value lies in how quickly you can integrate APIs and LLMs to build trendy, user centric services.

The Architect of Systems (The Fundamentalist): Focus on the "under-the-hood" mechanics. As LLMs flood the world with abstraction, we need engineers who understand why a system fails under high concurrency or why a memory management strategy is suboptimal.

I recently came across a blog post that perfectly encapsulates these sentiments. To be honest, this article has been motivated from here.

https://notes.eatonphil.com/2026-01-19-llms-and-your-career.html

Don't Just Accept -> Validate

The greatest trap of the LLM era is complacency. We must not settle for what the LLM hands us. Think of it this way: the time you've saved by not having to manually search through documentation should be reinvested into verifying and understanding the generated logic. When you combine your hard-earned experience with the efficiency of an LLM, your competitiveness doesn't just increase—it multiplies. You aren't just a consumer of AI; you are its auditor and architect.

Opening the Black Box: Back to the Metal

The "Black Box" problem is a silent threat. In the data world, we stand on the shoulders of giants like Kubernetes, Spark, Flink, and Airflow. Yet, very few engineers understand these tools beyond their documentation.

We must remember a fundamental truth: No matter how sophisticated the AI or how complex the technology, everything eventually runs on CPU, Memory, and Network. When a critical issue occurs in production, the root cause is almost always found within these three pillars. Engineers who can bridge the gap between high level AI abstractions and these fundamental hardware constraints will never be out of demand.

The "Build Your Own X" Philosophy

This is exactly why I’ve been focusing on "build-your-own-x" projects. The goal is to bridge the gap between "using" a tool and "understanding" its core principles.

Interestingly, you don't need to go back to school to do this. Your LLM is the ultimate mentor. A modern LLM can drastically reduce the time it takes to grasp complex internal architectures. It shouldn't just write your code; it should explain the "why" behind the logic, acting as a teacher that uncovers what you didn't even know you were missing.

Stay Curious

While headlines talk about developer layoffs, there is still an immense demand for engineers who possess deep technical intuition and an insatiable curiosity.

The AI era doesn't mean we study less; it means we must study deeper. Even if LLMs handle the bulk of the coding, services built for humans will always require people who understand the soul of the machine. Don't let the convenience of AI stifle your technical curiosity—use that convenience to fuel it.

Reference

https://notes.eatonphil.com/2026-01-19-llms-and-your-career.html

Story of 'smoodit' (1) : Electron to Tauri

kination — Sat, 24 Jan 2026 08:39:15 +0000

This article documents the story behind the development of text editor named as smoodit, and the lessons learned along the way.

Plan, and starting

The primary objective which I've plan was to implement private-based text editor desktop application with assistant, to boost editing efficiency by offering predictive text capabilities that anticipate user's needs.

On making PoC, I started with Electron platform, which is close to the industry standard. I initially thought about calling external LLM(from ChatGPT, Gemini, Claude...) APIs, but after thinking about offline mode and cost management, I decided to go with an embedded model. After evaluating a few methods, I settled on "Ollama" because of how easily it could be integrated into the workflow.

Why I think of migrating from Electron to Tauri

Electron has been the go-to for desktop apps for years, but its massive resource footprint (thanks to Chromium) started weighing down my project. My application needed to run Python backend and Local LLM engine (Ollama) simultaneously. The requirement to bundle an LLM significantly increased the application's footprint, depending on the model used. This made me much more conscious of the bundle size, which eventually became one of the primary catalysts for my decision to migrate.

For a project where performance and system agility are paramount, Tauri v2 emerged as the clear answer to next step.

Frontend: React + Vite (leveraging Tauri v2 APIs).
Backend (Sidecar): A FastAPI server packaged into a single binary using PyInstaller.
AI Engine (Sidecar): A raw Ollama binary serving local LLMs.

The Migration Roadmap

Phase 1: Mastering the Tauri Sidecar

One of Tauri's powerful features is "Sidecar" — ability to bundle and execute external binaries alongside application core.

Packaging Python: It used pyinstaller to freeze Python app into a standalone executable and make it run independent.
Configuration: I registered both backend_server and the ollama binaries in src-tauri/tauri.conf.json under externalBin.
In Tauri v2, sidecar binaries must include target triple suffix (ex> -aarch64-apple-darwin) in their filenames to be correctly identified during runtime.

Phase 2: Bridging the Frontend and Sidecars

WebView security policies are very strict about local network requests.

Instead of standard fetch, utilized @tauri-apps/plugin-http plugin. This allows React frontend to bypass CORS issues and speak directly to local FastAPI backend.
User Experience: Added "Health Check Polling" mechanism. The UI remains in "Initializing" state until the backend sidecar reports status 200 OK, ensuring no requests are lost in the void during startup.

Debugging stories

The most interesting (and stressful) part of any migration is the troubleshooting. Here are several issue I've encountered and how I've fixed.

For MacOS' user — Quarantine

When downloading or bundle third-party binaries, MacOS marks them with a "Quarantine" attribute. When Tauri tried to spawn Ollama, it would fail silently without any visible error.
So, I've added a cleanup step in our package.json build script using xattr -d com.apple.quarantine to strip these attributes from all bundled binaries before execution.

PIPE Buffer Hang

Originally, I've used subprocess.PIPE to capture Ollama's logs in Python. However, when the log data exceeded system's buffer size, the entire Ollama process would freeze (hang).
For this, I've redirected sidecar output to a dedicated log file at ~/ollama_sidecar.log. This not only prevented the buffer-related hangs but also gave us a persistent way to inspect server logs.

Zombie Processes and Lifecycle Management

Also, I've struggled with Ollama instances status staying alive after application closed (zombie processes) because it was running on separate process.
So I've used start_new_session=True in Python subprocess spawn, to detach child from the parent session. Furthermore, I've implemented socket-based port check to verify if the port was truly bound before declaring the server ready.

So, was this worth it?

The results speak for themselves. The installation package size plummeted compared to the Electron version, and memory usage is significantly lower. The combination of a React UI and the speed of Tauri v2, backed by the raw power of Python and Ollama, makes for a truly premium developer tool experience.

'Chainguard' image for secure service

kination — Sat, 17 Jan 2026 05:28:28 +0000

If you work as DevOps or system backend development, one of critical point causing your stress will be 'security'. Even though it is always at the top of the priority list, its substance remains elusive.

Assume you are maintaining k8s cluster, or docker image(in recent system, most of devops engineer will be related at least one of these). Even if your code is secure, the base image (Debian, Ubuntu, or even Alpine) often carries inherent technical debt, and within that debt, there are often security risks capable of compromising the service. Now, Chainguard image starts at this point.

What is Chainguard?

In short, Chainguard Images are "secure-by-default" container images. They are built on top of Wolfi, a Linux "undistro" designed specifically for containers.

Unlike standard Docker Hub images, Chainguard images have a few distinct characteristics:

Distroless: They contain the bare minimum required to run the app. No shell, no package managers (in the runtime), and no bloat.
Daily Rebuilds: Every image is rebuilt daily from upstream sources to patch vulnerabilities immediately.
SBOMs & Signing: They come with Software Bill of Materials and Sigstore signatures out of the box.

The Showdown: Official vs Chainguard

The most immediate benefit is reduction in noise. Let's look at a comparison I ran using trivy on a standard Python image and Chainguard.

Official Image (python:3.11): Over 300 vulnerabilities, with some of them categorized as "Critical" or "High"
Chainguard Image (cgr.dev/chainguard/python:latest): 0 CVEs

This isn't magic; it's just aggressive minimalism. By removing the OS components that your application doesn't actually use, you remove the attack surface.

Hands-on: Migration Guide

Migrating isn't always a simple swap, and it's more painful with Chainguard. This images are distroless, you cannot use apt-get or apk in the final runtime image. You must use multi-stage builds.

Tool restriction

A typical (and slightly vulnerable) Dockerfile might look like this:

# Standard Python Image
FROM python:3.9-slim

WORKDIR /app

# Installing dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

# Running as root (default)
CMD ["python", "app.py"]

It is pretty simple dockerfile, to run python application.

Here's the migrated version with chainguard image:

# 1: Make 'builder'
# Use '-dev' tag here, because we need a shell and build tools (gcc, headers, etc.)
FROM cgr.dev/chainguard/python:latest-dev AS builder

WORKDIR /app

# Create a virtual environment
RUN python -m venv /app/venv
ENV PATH="/app/venv/bin:$PATH"

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# 2: Actual 'runner'
# This image has no 'shell' and 'package manager'. It is pure runtime.
FROM cgr.dev/chainguard/python:latest

WORKDIR /app

# Copy the virtual environment from the builder
COPY --from=builder /app/venv /app/venv
COPY . .

# Set path to use the virtualenv
ENV PATH="/app/venv/bin:$PATH"

# chainguard runs as a non-root user ('nonroot') by default.
# No need to create a user manually.
CMD ["python", "app.py"]

Gotchas and Troubleshooting
"I can't exec into the pod!": Since the runtime image has no shell (/bin/sh or bash), you cannot run kubectl exec -it -- sh. If you need to debug, you must use ephemeral debug containers or temporarily switch to the -dev image tag.

Permissions: Chainguard enforces non-root execution. If your app tries to write to / or install packages at runtime, it will fail. Ensure your app only writes to designated directories (like /tmp or a mounted volume) where the nonroot user has permissions.

Package restriction

Work without `sudo`

"Why can't I run or install sudo?"

...is perhaps the most common question when first switching to Chainguard(so do I).

By security-first design, chainguard images do not include sudo command (and mostly they do not allow to install through package manager). They are built to follow the principle of least privilege, running as a pre-configured non-root user (UID 65532) by default. This effectively eliminates entire classes of privilege escalation attacks.

The Problem: Legacy setup scripts or runtime commands that rely on sudo will fail with a command not found error.

The Solution: Frankly, there are no simple solution.

You must shift your mindset to the build stage. First thing you need to do is refactoring application design, to run without root access as possible. For example, Ii your app needs to bind to a privileged port (like 80), don't make it go on by using sudo. Instead, bind to a higher port (like 8080) and handle the mapping at the infrastructure level (e.g., Kubernetes Service or Load Balancer).

But still there will be several process still needs it. Any operation requiring root privileges, such as installing system dependencies or modifying file permissions, must be performed in a multi-stage build using the USER root directive.

"Latest-Only" Package Philosophy

Chainguard's underlying package ecosystem, Wolfi, operates on a Rolling Release model. This introduces a specific constraint regarding package versioning.

Fixed Repo Constraints: Chainguard images are configured to pull exclusively from trusted Wolfi or Chainguard repositories. Adding third-party or unverified repos is strongly discouraged to maintain the "Zero CVE" guarantee.

The Latest-Only Rule: In the public Wolfi repository, generally only the latest stable version of a package is maintained.

If you try to pin an outdated version (e.g., apk add openssl=1.1.1), you will likely encounter an "Unsatisfiable constraints" error because that version has been removed from the index to prevent the use of vulnerable software.

The Philosophy: Chainguard forces you to stay current. This "latest-only" approach ensures that you aren't unknowingly baking known vulnerabilities into your images. If your system absolutely requires a legacy, EOL (End-of-Life) version, you may need to consider their Enterprise tier, which provides a private repository for older, patched versions.

The Trap of "Legacy Version" Maintenance

One of the biggest reasons teams stick to old Docker images (like node:14 or python:3.6) for production-level service, is stability.

"It works, so please don't touch it."

However, in the standard Docker world, if you pin a version like python:3.6-slim, that base OS (likely an old Debian version) stops receiving security updates. You are sitting on a ticking time bomb of OS-level vulnerabilities.

Frankly, everybody knows they needs to do something, but they cannot. Migrating production software is not just changing some code and package. It needs long-term testing, thinking of millions of use cases. Modern software are built on top of giant "system & source codes", and even top-level software engineer cannot expect all of exceptional cases. That's why even well-skilled engineers are concerning of migration every time.

Chainguard changes this paradigm. Even if you use a specific language version, Chainguard rebuilds the underlying Wolfi OS layers every single day.

The Good: You get the stability of your language version with the security of a bleeding-edge OS.
The Bad: The image hash changes daily. If your deployment pipeline relies on the exact same SHA digest existing for months, this will break your flow. You need to get used to the idea of Rolling Tags.
The Cost: It is worth noting that for End-of-Life (EOL) language versions (like Python 3.7 or Java 8), Chainguard usually moves those images to their paid tier. The free tier focuses on currently supported versions.

Reference

"Container" with OCI runtime

kination — Sun, 17 Aug 2025 09:40:28 +0000

Rise of Docker project

Introduced in 2010 by "dotCloud"(renamed to Docker Inc.), and now docker became common standard of containerization. Also, it made containerizing tech to be very close to every engineers(not only infrastructure part), to setup common development environment easily for co-working engineers

It is running based on container internally, which is a lightweight, standalone, and executable software package that encapsulates an application and all its dependencies, including code, runtime, system tools, libraries, and settings.

Container?

Probably you will be familiar with "Docker image", which is kind of blueprint to create instance. Containers are composed by one or more instances. Which means image became instance when running (such as docker run), it becomes container, providing isolated/consistent environment to make application execute based on 'runtime'.

Runtime of container is low-level software component, responsible for executing and managing containers on a host operating system. It acts as the interface between the container and underlying system resources.

Initially, Docker didn't have its own low-level container runtime, so it has relied on existing Linux Kernel feature for processing isolation named LXC (Linux Containers). However, this created a dependency on a specific kernel API and had some limitations in portability and functionality.

To avoid these, Docker developed "libcontainer". This provides native Go implementation for creating containers with namespaces, cgroups, capabilities, and filesystem access controls. It allows you to manage the lifecycle of the container performing additional operations after the container is created.

What is runC?

runc is a CLI tool for spawning and running containers on Linux according to the OCI specification. In other words, it handles low-level works to create/manage containers based on standard specification named OCI Runtime Specification(let's talk about this bit later).

When you run a command like docker run, docker daemon (in detail, it will be higher-level runtime like containerd) ultimately calls runC to perform the actual container creation.

OCI Runtime?

Open Container Initiative Runtime, or OCI Runtime, is specification that defines how to run a containerized application. This specification provides a standardized way to define the configuration and lifecycle of a container, ensuring that any OCI-compliant runtime can correctly execute a container created by another.

The core of this specification is the filesystem bundle which contains:

rootfs: A directory containing the root filesystem of the container.
config.json: A JSON file, that defines all the configuration for container process.

The config.json is the configuration to generate container. It specifies essential details for OCI-Runtime to set up the isolated environment correctly.

Here's some sample:

{
  "ociVersion": "1.0.2",
  "process": {
    "terminal": true,
    "user": {
      "uid": 0,
      "gid": 0
    },
    "args": [
      "/bin/sh"
    ],
    "env": [
      "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
    ]
  },
  "root": {
    "path": "rootfs",
    "readonly": true
  },
  "hostname": "my-container",
  "mounts": [],
  "linux": {
    "namespaces": [
      {
        "type": "pid"
      },
      {
        "type": "network"
      },
      {
        "type": "ipc"
      },
      {
        "type": "uts"
      },
      {
        "type": "mount"
      }
    ],
    "cgroupsPath": "/my-container-cgroup"
  }
}

You can find full specification here -> https://specs.opencontainers.org/runtime-spec/config/

Implementing Basic OCI-Runtime

Implementing a basic OCI-compliant runtime involves utilizing core Linux kernel features to creat isolated environment based on the "config.json" file.

This is key requirements:

Parse config.json: The runtime will first read and parse the file, to understand the container's configuration.
Create namespaces: It must call system calls like "unshare" or "clone" to create new Linux namespaces (PID, mount, UTS, network, etc.) as specified in the configuration file. These namespaces isolates container's processes from host system.
Set up cgroups: The runtime needs to interact with "cgroup" filesystem (/sys/fs/cgroup) to create and configure the control group for new container. This allows to apply resource limits for CPU, memory, I/O, etc..
Change root filesystem: Perform pivot_root or chroot system call to change root filesystem of the container process to rootfs directory specified in config. This ensures the container can access its own filesystem only.
Execute: Finally, the runtime triggers execve system call to replace its own process to container's main command (e.g., /bin/sh) inside new, isolated environment. This new process becomes PID 1 inside the container's namespace.

[package]
name = "my-oci-rt"
version = "0.0.0"
edition = "2021"

[dependencies]
nix = { version = "0.30.0", features = ["sched", "process", "user", "hostname", "fs"] }
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"

use nix::sys::wait::waitpid;
use nix::unistd::{chroot, setgid, sethostname, setuid, Gid, Uid};
use nix::sched::{clone, CloneFlags};
use serde::Deserialize;
use std::ffi::CString;
use std::fs;
use std::path::PathBuf;


#[derive(Deserialize, Debug)]
struct Spec {
    process: Process,
    root: Root,
    hostname: Option<String>,
}

#[derive(Deserialize, Debug)]
struct Process {
    args: Vec<String>,
    user: User,
}

#[derive(Deserialize, Debug)]
struct User {
    uid: u32,
    gid: u32,
}

#[derive(Deserialize, Debug)]
struct Root {
    path: PathBuf,
}

// size of stack (1MB)
const STACK_SIZE: usize = 1024 * 1024;

fn main() {
    let raw_config = fs::read_to_string("config.json").expect("Failed to read config.json");
    let spec: Spec = serde_json::from_str(&raw_config).expect("Failed to parse config.json");

    // Create child process
    let mut stack = [0; STACK_SIZE];

    let child_closure = || -> isize {
        println!("== Inside child process {} ==", nix::unistd::getpid());

        if let Some(hostname) = &spec.hostname {
            sethostname(hostname).expect("Failed setting hostname");
        }

        chroot(&spec.root.path).expect("chroot failed");
        std::env::set_current_dir("/").expect("Failed changing directory to /");

        // Setup gid/uid 
setgid(Gid::from_raw(spec.process.user.gid)).expect("Failed setting GID");
        setuid(Uid::from_raw(spec.process.user.uid)).expect("Failed setting UID");

        // parse/trigger command
        let command = &spec.process.args[0];
        let args: Vec<CString> = spec.process.args.iter()
            .map(|s| CString::new(s.as_str()).unwrap())
            .collect();

        // swap to new process
        nix::unistd::execvp(&CString::new(command.as_str()).unwrap(), &args).expect("execvp failed");

        0
    };

    // generate namespace for uts, pid, namespace
    let clone_flags = CloneFlags::CLONE_NEWUTS | CloneFlags::CLONE_NEWPID | CloneFlags::CLONE_NEWNS;
    let child_pid = unsafe { clone(Box::new(child_closure), &mut stack, clone_flags, Some(nix::sys::signal::Signal::SIGCHLD as i32)) }
        .expect("clone failed");

    println!("== In Parent Process ==");
    println!("Spawned child with PID: {}", child_pid);

    // wait shutting down child process
    waitpid(child_pid, None).expect("waitpid failed");
    println!("Child process exited.");
}

Things to prepare:

To run this project, prepare following features

Simple config.json:

{
  "ociVersion": "1.0.0",
  "process": {
    "args": [
      "/bin/sh",
      "-c", 
      "echo 'Hello container!' && ls -l"
    ],
    "user": {
      "uid": 0,
      "gid": 0
    }
  },
  "root": {
    "path": "rootfs"
  },
  "hostname": "my-container"
}

Root filesystem for container, rootfs directory:

Let's use simple environment with "busybox"

$ mkdir rootfs
$ cd rootfs
$ wget https://busybox.net/downloads/binaries/1.35.0-x86_64-linux-musl/busybox
$ chmod +x busybox

Result

Now, you will see message from container as following

$ ./target/debug/my-container
== In Parent Process ==
Spawned child with PID: 4500
== In Child Process (PID: 1) ==

Hello container!
total 1108
drwxr-xr-x    4 0        0              128 Aug 18 02:52 bin
-rwxr-xr-x    1 0        0          1131168 Jan 17  2022 busybox

Child process exited.

Note>

nix library only allows you to use system calls that exist on the operating system you're compiling for. So, when compiling on macOS, the compiler will throw following error:

...
  |
1 | use nix::sched::{clone, CloneFlags};
  |                  ^^^^^  ^^^^^^^^^^ no `CloneFlags` in `sched`
  |                  |
  |                  no `clone` in `sched`
  |
  = help: consider importing this module instead:
          std::clone
...

nix::sched::clone module and function are Linux-specific, so it is not included in nix library for macOS.

If you're trying to use this in Mac, build/run with separated linux environment with "Docker Desktop".

Reference:

'Spark Connect' in Apache Spark 4.0

kination — Thu, 17 Jul 2025 03:01:18 +0000

Apache Spark 4.0 became official few month ago, with lots of enhancements and new features.

While the release includes various improvements, the evolution of Spark Connect particularly stands out as a significant leap forward, making remote Spark sessions more seamless and powerful.

What is `Spark Connect`?

If you're unfamiliar with this, you can think as "decoupled client-server architecture" for Apache Spark.

As you see in this image, it has client and server component. By using client API, it allows developers to interact with a Spark cluster from any application or environment, such as IDEs, notebooks, and applications written in various languages. You can think of similarity with Apache Livy, but it has much more thin client to make application lightweight. Moreover, it is based on same API format and name with classic spark module, so it does not require additional API translation from client.

It starts by translating DataFrame operations into unresolved logical plans encoded using protocol buffers in client side. These are sent to the server using the gRPC framework, and received protobuf messages will be reconstructed to logical plan. From there, common Spark execution process goes on, such as optimizing logical plan, transforming to physical plan, and execute.

After execution, result will be return in "Apache Arrow" record format, streamed back with gRPC, so can be read sequentially in client side.

This separation of client and server provides greater flexibility, easier dependency management.

What's in Spark 4.0 for this

Enhanced API Coverage and Stability

Previous Spark Connect user may have encountered missing or partially implemented APIs. The development team has made a great effort to address these gaps in Spark 4.0.

Through hard work from engineers done in SPARK-47908 and SPARK-49248, Spark Connect now achieved much higher API parity with common Spark implementation. This means that more of your existing Spark code will work seamlessly in 'connect' mode, without requiring extensive rewrites.

This improved compatibility makes migrating existing applications to the connect architecture a much smoother and more predictable process.

Simple trigger switch for 'connect' mode

One of the most user-friendly changes in Spark 4.0 is the simplified process to enable Spark Connect. Previously, setting up a remote connection required more specific configuration. Now activating it become simple, by adding single configuration property.

By progress in SPARK-50605, you can now switch to 'connect' mode just by setting "spark.api.mode"

from pyspark.sql import SparkSession

SparkSession.builder.config("spark.api.mode","connect").master("...").getOrCreate()

$ spark-submit --master "..." --conf spark.api.mode=connect

Default parameter for this is 'classic', and it will work in this way if you don't define anything, for ensuring backward compatibility. It is a small change, but significantly lowers the barrier to entry for developers looking to leverage the benefits of remote, interactive Spark environment.

Machine Learning modules

Another significant leap will be Machine Learning (ML) capabilities. As outlined in SPARK-50812, you can now execute not only SQL/DataFrame operations, but also ML workloads through Spark Connect.

This will be great improvements for data scientists who want to build, train, and manage models on a powerful cluster directly from their local development environment or notebook.

And they are still working for additional improvements which targets 4.1 in SPARK-51236, so you can expect for this too.

Reference:

Retrieval-Augmented Generation (RAG) for AI starter

kination — Thu, 10 Jul 2025 12:30:11 +0000

AI everywhere, but it's not private

With advancement of AI technologies, now it's possible to access a wide range of information through conversational interfaces, with just asking "what is xxx?". Because of strong usefulness, many companies are encouraging employees to use AI chatbots to boost productivity, and some of them even covers subscription costs.

However, these AI models are trained with publicly available data, and naturally, they don't know about any proprietary or internal company information. For instance, if an employee asks, "What does it take to get a good performance review at work?", a general-purpose AI may respond with vague advice like "Deliver good results", rather than offering company-specific guidelines.

Introduction to RAG(Retrieval-Augmented Generation)

To build an AI system that can answer such internal questions accurately, organizations typically need to integrate their own data into the AI's capabilities.

One common approach is to train a generative AI model with internal documents through a method known as Fine-tuning. However, this is often resource-intensive, requiring significant time and cost, and it's inefficient when internal information changes frequently. For this reason, the Retrieval-Augmented Generation(a.k.a RAG) approach is generally preferred.

RAG leaves the base generative model untouched, and retrieves relevant documents at query time. When a user submits a question, the system searches internal knowledge sources, retrieves the most relevant content, and provides it to the chatbot as context. This allows the AI to deliver accurate and up-to-date responses, while also enabling users to trace the source of the information.

Following code is simple example of this logic, using gemini API

import google.generativeai as genai

genai.configure(api_key="<your-api-key>")
model = genai.GenerativeModel('<gemini-model-name>')


def gemini_rag_query(question, context_docs):
    context = "\n".join(context_docs)

    prompt = f"""
    Introduction:
    {context}

    Question: {question}

    Please describe by referring "Introduction"
    """

    response = model.generate_content(prompt)
    return response.text


cont_doc = """
My name is 'kination'.
I'm software engineer. My skill is

Programming:
Skilled: Java, Python, TypeScript(JavaScript), Scala
Others: Rust

Experienced in:
Data Engineering
Cloud Infrastructure
Application development: Web, Android

Language:
Korean: Native
English: Business Level
Japanese: Conversational (JLPT - N2 certificated)
"""

response = gemini_rag_query("Could you describe about user kination?", cont_doc)
print(response)

and here's the result:

...
Based on the provided introduction, 'kination' is a software engineer.

They possess strong programming skills, being skilled in Java, Python, TypeScript (JavaScript), and Scala, with additional familiarity with Rust. Their experience extends to Data Engineering, Cloud Infrastructure, and application development for both web and Android platforms.

Regarding languages, 'kination' is a native Korean speaker, proficient in English at a business level, and conversational in Japanese, holding a JLPT - N2 certificate.

If you input other username(such as 'Jimmy Rock'), it will response

There is no information about 'Jimmy Rock' in this document...

In this sample, content has been offered with raw text as variable. So if you need to setup document with files such as pdf, word, it needs additional logic to setup contents as chunk data.

from langchain.document_loaders import PyPDFLoader, UnstructuredWordDocumentLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter


pdf_loader = PyPDFLoader("kination.pdf")
# If content is word
# word_loader = UnstructuredWordDocumentLoader("kination.docx")

text_splitter = RecursiveCharacterTextSplitter(chunk_size=250, chunk_overlap=50)
chunks = text_splitter.split_documents(pdf_loader.load())

...

This typically involves splitting the content into smaller, manageable chunks, which helps improve the relevance of search results during retrieval.

Once the content is chunked, each chunk needs to be transformed into a vector. It is some kind of numerical representation that captures the semantic meaning of text (usually with using embedding models). These vectors are then stored in vector DB, which allows efficient similarity search with user's query.

Reference

Setting up memory for Flink 2 - What to think about...

kination — Mon, 23 Jun 2025 08:22:50 +0000

Previously, I've written about memory configuration at Flink.

https://dev.to/kination/setting-up-memory-for-flink-configuration-4jm1

...and this is the considerations you need for managing memory.

General

By understanding and configuring these memory types appropriately, you can optimize Flink's performance and ensure efficient resource utilization in your applications.

Balancing Memory Types

It's important to balance the allocation of these different memory types based on the specific needs of your application. Over-allocating one type of memory can lead to under-utilization of others.

Monitoring and Tuning

Regularly monitor memory usage and performance metrics to fine-tune memory settings. Tools like Flink's Web UI and external monitoring solutions can provide insights into memory usage patterns.

JobManager

Increasing the memory allocated to JobManager in Apache Flink can significantly enhance the stability and performance of your Flink cluster, particularly in large-scale deployments with numerous jobs. Here's how:

Improved Job Scheduling and Management

With more memory, JobManager can handle a larger number of concurrent jobs. This is crucial in environments where multiple jobs are submitted and executed simultaneously.
Efficient Scheduling Increased memory allows JobManager to maintain more detailed metadata about jobs, which can improve the efficiency of job scheduling and resource allocation.

Enhanced Metadata Management

The JobManager stores job graphs and execution plans in memory. With more memory, it can efficiently manage and store these data structures for a larger number of jobs.
Also, JobManager coordinates checkpoints across all tasks. More memory allows it to handle complex checkpointing scenarios, especially in jobs with large state sizes or many parallel tasks.

Better Fault Tolerance

In the event of a failure, JobManager needs to recover job states and restart jobs. More memory enables it to store comprehensive recovery information, reducing downtime and improving fault tolerance.
For jobs with large states JobManager can manage state backends more effectively, ensuring that state snapshots and recovery processes are handled smoothly.

Improved Performance in High-Throughput Scenarios
JobManager processes various events, such as task status updates and resource allocation requests. More memory allows it to handle a higher volume of events without becoming a bottleneck. By having sufficient memory to manage internal queues and buffers JobManager can reduce latency in job execution and coordination.

Considerations

While increasing JobManager memory can provide these benefits, it's important to balance this with the overall resource availability in your cluster. Over-allocating memory to the JobManager might lead to resource constraints elsewhere.

Regularly monitor JobManager's memory usage and performance metrics to ensure that the allocated memory is being utilized effectively. Adjust configurations based on observed performance and workload characteristics.

By allocating sufficient memory to JobManager, you can enhance the robustness and efficiency of your Flink deployment, especially in scenarios involving complex, high-throughput, or large-scale job executions.

TaskManager Memory

Allocating more memory to TaskManager in Apache Flink can significantly enhance its ability to handle larger state backends, improve data processing efficiency, and reduce I/O operations. Here's how:

Larger State Backends

Flink's stateful stream processing allows tasks to maintain state across events. This state is crucial for operations like aggregations, joins, and windowing.

With more memory, a larger portion of the state can be kept in memory, which is faster to access than disk-based storage. This is particularly beneficial for applications with large state requirements.

Also, when memory is limited, Flink may need to spill state to disk, which can slow down processing. More memory reduces the need for spilling, allowing for faster state access and updates.

More Efficient Data Processing

Additional memory allows for larger buffers and caches, which can improve the throughput of data processing tasks. This is especially important for operations that involve sorting, grouping, or joining large datasets.

More memory can support higher levels of parallelism by allowing more task slots per TaskManager. This enables the processing of more data in parallel, improving overall job throughput.

Sufficient memory helps in managing data flow more effectively, reducing the likelihood of back-pressure, which can occur when downstream tasks cannot keep up with upstream data production.

Reduced I/O Operations

By keeping more data in memory, TaskManagers can minimize the need for disk I/O operations, which are typically slower than memory operations. This is crucial for maintaining high throughput and low latency.

Also more memory allows for efficient checkpointing by reducing the frequency and size of data written to persistent storage. This can speed up the checkpointing process and reduce the impact on processing performance.

And more, larger memory allocations can improve network I/O by allowing more data to be buffered and sent in larger batches, reducing the overhead associated with frequent small network transmissions.

Considerations

While increasing memory can provide these benefits, it's important to balance memory allocation with other resources like CPU and network bandwidth to avoid bottlenecks.

Regularly monitor memory usage and performance metrics to ensure that the additional memory is being utilized effectively. Adjust configurations based on workload characteristics and observed performance.

By allocating more memory to TaskManagers, you can enhance the performance and efficiency of Flink applications, particularly those with large state requirements or high data throughput. This leads to faster data processing, reduced latency, and improved scalability.

How to treat secure data on lakehouse

kination — Tue, 08 Apr 2025 11:25:46 +0000

In the modern data stack, the lakehouse has emerged as a hybrid solution that combines the scalability of a data lake with the transactional power of a data warehouse. But with this convergence comes a heightened need to manage secure data responsibly - whether it’s personally identifiable information (PII), health records, or financial transactions.

In this blog, we’ll walk through how to treat secure data in a Lakehouse architecture, from ingestion and storage to governance and auditing.

What is Lakehouse?

Lakehouse is a modern data architecture that combines the best of both data lake and data warehouse. It aims to deliver the scalability and flexibility of a data lake with the reliability, performance, and governance of a warehouse. By unifying raw and structured data in one platform, lakehouses support analytics, BI, and machine learning without the need for complex ETL pipelines.

The goal is to get the benefits of both systems while simplifying data workflows. However, while powerful, it also facing challenges on query performance, governance maturity, and integration with legacy tools comparing with each of them.

Keep it secure

Convert existing data

In many data systems, certain columns contain sensitive information(user name, email, ID...)that must be handled differently to comply with privacy regulations and national security policies. To protect this data, various techniques can be applied depending on the use case.

If only specific teams (ex> research department) need access to the original values, the data can be encrypted using keys that are only shared with those teams. When the original value is not needed but uniqueness is required—for example, for joining datasets—hashing is a suitable approach. In cases where partial visibility is acceptable (ex> showing only the last four digits of a phone number), string masking can be used to obscure parts of the value while retaining some context.

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, from_json, udf
from pyspark.sql.types import StructType, StringType
from cryptography.fernet import Fernet  # assume using fernet for encryption
import base64

# ---------- Setup ----------
spark = SparkSession.builder.appName("SecureSample").getOrCreate()

# ---------- Sample Schema ----------
schema = StructType([
    StructField("name", StringType()),
    StructField("email", StringType()),
])

SECRET_KEY = b"..."
fernet = Fernet(SECRET_KEY)

def encrypt_email(email: str) -> str:
    if email is None:
        return None

    encrypted = fernet.encrypt(email.encode())
    return encrypted.decode()

encrypt_udf = udf(encrypt_email, StringType())

# ---------- Read from Kafka ----------
kafka_df = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "broker-server-host:9092") \
    .option("subscribe", "kafka-topic-name") \
    .load()

json_df = kafka_df.selectExpr("CAST(value AS STRING)") \
    .select(from_json(col("value"), schema).alias("data")) \
    .select("data.*")

# encrypt email column
encrypted_df = json_df \
  .withColumn("email_encrypted", encrypt_udf(col("email"))) \
  .drop("email")

# ... do something with encrypted data frame ...

Access limitation

Fine-grained access control ensures that only authorized users can access specific data fields, rows, or columns—crucial for protecting sensitive information in a lakehouse.

AWS offers this through Lake Formation, which allows you to define column/row level permissions on tables stored in S3. You can grant access to specific users or groups using IAM roles, and manage access using the central Lake Formation console. Integration with AWS Glue Data Catalog ensures governance across multiple tools.

GCP also provides column-level security in BigQuery, where you can restrict access to specific columns using Data Catalog tags and IAM policies. For row-level security, authorized views or row access policies can be applied to show only the data a user is allowed to see.

Both platforms support auditing, logging, and attribute-based access control (ABAC) to track and enforce data governance at scale. These features are key to meeting compliance requirements like GDPR or HIPAA while keeping analytics flexible and secure.

Reference

Time of YAML/JSON for data engineer

kination — Sun, 06 Apr 2025 09:49:16 +0000

If you're tracking trends of data engineering, you could find out that there's a big shift in tools we use.

Recently, Java(+Scala) and Python have been a go-to languages for data engineers, and it's still true. However, there's a growing trend towards using YAML/JSON configuring to control data pipeline. Will it be a big change to us?

"Configuration" on top

Declarative Pipelines

One of the primary reasons data engineers are leaning towards YAML/JSON, is major platform(or cloud) service provider are offering declarative pipelines. These platforms allow(or willing) engineers to define data workflows only by using config files, instead of making from code.

This approach simplifies the process of setting up and managing complex data pipelines, making it more accessible and causing less exceptional cases.

Cloud services and simplified pipelines development

As adoption of cloud services such as AWS, Azure, and GCP increases, companies are building their data pipelines upon these platforms. These providers offers various functions and services that simplifies data pipeline creation and management.

While these services come at an additional cost, they significantly reduce the complexity points involved in pipeline development. It allows engineers to focus on designing main tasks in high level, rather than struggling in low-level coding.

From comfort, but now become a trend

What started as a move towards greater comfort and ease of use has now become a trend due to the flexibility these tools offer. YAML/JSON allow for quick adjustments and iterations, making them ideal for the fast-paced environment of data engineering. This flexibility is particularly valuable in a field where requirements can change rapidly, and the ability to adapt is crucial.

Is this shift beneficial for Engineers?

The shift towards using config files brings both opportunities and challenges for data engineers. It is true that these tools can increase productivity by simplifying complex tasks and reducing the need for extensive coding. But on the other hand, they may limit the depth of customization and optimization that can be achieved with traditional programming languages like Java and Python.

Conclusion

The trend towards using configuration files in data engineering reflects a broader shift towards simplicity and flexibility in the field. While this shift offers many benefits, it's essential for data engineers to maintain a balance, ensuring they have the skills and knowledge to leverage both traditional programming languages and modern configuration tools effectively. It means it can give simplicity on development, but it also requires engineers to have more deep knowledge of how this configuration affects to the pipeline under the hood.

Reference

build-my-own-datalake: Starting from PoC

kination — Thu, 20 Mar 2025 07:02:25 +0000

build-your-own-x

build-your-own-x is the project which has been started from GitHub community. As you can expect through name, it aims to share code that builds well-known projects or applications(database, server framework, os, and more) from the bottom.

As you know there are a common programming tip "Don't re-invent the wheel" which says that building something new from scratch is often a foolish endeavor. However, I believe this advice is only partially correct.

In terms of time efficiency, developing something from the ground up can indeed be time-consuming. But it's worth noting that long-established systems often carry significant technical debt and can be challenging to modify, especially when they're widely used in production environments.

So, if you have a clear purpose and compelling reasons for creating a new system from scratch, it's not necessarily a waste of time. In fact, it can be a valuable and justified approach in certain situations."

Motivation

Of course, thing I'm starting here, doesn't have such a grand purpose. Currently I've worked as an data engineer for several years, and faced on file format which is specified to big data like parquet/orc.

datalake format

Recently one of trend in data engineering is datalake format, which is making a table based on group of file and metadata in particular architecture. It gives flexibility on keeping data with ACID transactions, versioning, and schema enforcement.

As you may know, there are already widely used datalake formats(iceberg, hudi, delta, ...). These formats are being rapidly developed with strong community support, and I'm also using them as part of my job.

However, aside from using them effectively, I have always been curious about how they work internally. While they are open-source and can review through them, I believe that building one from the bottom is also effective way to understand them.

That’s why I decided to start this project. Additionally, since most existing formats are based on Java/Scala, I wanted to explore whether using Rust could be an effective alternative.

How to start

So for the project, first part to make is

metadata
core file reading/writing part

Also, it should offer interface to communicate with data processing frameworks such as Spark/Flink, so should make JNI also.

Result

Here's current full result, which named as vine.

https://github.com/kination/vine

I'll describe about core part only in Part 1 post, and do others at next.

Core

#[no_mangle]
#[allow(non_snake_case)]
#[allow(unused_variables)]
pub extern "C" fn Java_io_kination_vine_VineModule_readDataFromVine(
    mut env: JNIEnv,
    class: JClass, 
    dir_path: JString) -> jobject {
    let path: String = env.get_string(&dir_path).expect("Cannot get data from dir_path").into();
    let rows = read_data(&path);
    let mut result = String::new();

    for row in rows {
        result.push_str(&row);
        result.push('\n')
    }

    let output = CString::new(result).expect("Cannot generate CString from result");    
    env.new_string(output.to_str().unwrap()).expect("Cannot create java string").into_raw()
}

#[no_mangle]
#[allow(non_snake_case)]
#[allow(unused_variables)]
pub extern "C" fn Java_io_kination_vine_VineModule_writeDataToVine(
    mut env: JNIEnv,
    class: JClass,
    path: JString,data: JString) {

    let path_str: String = env.get_string(&path).expect("Fail getting path").into();
    let data_str: String = env.get_string(&data).expect("Fail getting path").into();
    let rows: Vec<&str> = data_str.lines().collect();
    write_data(&path_str, &rows).expect("Failed to write data");
}

The Core module acts as the vital bridge between JVM and Rust engine. Using the Java Native Interface (JNI), there is readDataFromVine and writeDataToVine to work for data exchanging.

Currently, data is passed as raw strings across the boundary. While this approach allowed for rapid prototyping, it causes overhead involved in string allocation and C-string conversion. There will be better option, but this is to make quick approach.

Metadata

The metadata file is JSON-based and contains information about fields and the table name. Currently, it's stored as a single file, but additional files will be needed to support versioning and schema evolution.

{
  "table_name": "vine-test",
  "fields": [
    {
      "id": 1,
      "name": "id",
      "data_type": "integer",
      "is_required": true
    },
    {
      "id": 2,
      "name": "name",
      "data_type": "string",
      "is_required": false
    }
  ]
}

Writer

Writing, goes as

Read raw data which is list of string
Read metadata, and check field type
Match field type and data
Write to file

On parsing metadata, it only supports 4 kinds of type, but should do it for more.

...
let metadata: Metadata = serde_json::from_str(&meta_str).expect("Failed to deserialize metadata");
    let meta_fields = metadata.fields.clone();

    let mut schema_str = String::from("message schema {\n");
    for field in meta_fields {
        let field_type = match field.data_type.as_str() {
            "integer" => "REQUIRED INT32",
            "string" => "REQUIRED BINARY",
            "boolean" => "REQUIRED BOOLEAN",
            "double" => "REQUIRED DOUBLE",
            _ => continue,
        };

        match field_type {
            "REQUIRED BINARY" => schema_str.push_str(&format!("    {} {} (UTF8);\n", field_type, field.name)),
            _ => schema_str.push_str(&format!("    {} {};\n", field_type, field.name))
        }
    }
...

After defining types, it writes the data in parquet format (of course, it can be changed)

One known issue is that the current logic follows these steps:

Define the type of the raw data.
Store the raw data in a temporary variable of that type.
Write the data through the temporary variable.

However, I believe there's a more efficient approach to streamline this process.

Reader

let meta_str = read_to_string("vine-test/vine_meta.json").expect("Failed to read vine_meta.json");
    let meta: Value = serde_json::from_str(&meta_str)
                    .expect("Failed to parse metadata JSON");

    for (_, path) in directories {
        let sub_entries = fs::read_dir(path).expect("Cannot read path");
        for se in sub_entries {
            let file_path = se.expect("Cannot get sub entry").path();
            if file_path.extension().map_or(false, |ext| ext == "parquet") {    
                let fields = meta["fields"].as_array()
                    .expect("fields should be an array");

                let file = File::open(file_path).expect("Cannot open file from file_path");
                let reader = SerializedFileReader::new(file).expect("cannot serialize file");
                let iter = reader.get_row_iter(None).expect("Cannot get row iterator");

                for row_result in iter {
                    if let Ok(row) = row_result {
                        let mut values = Vec::new();

                        for field in fields {
                            // Fields are 1-indexed in metadata, but 0-indexed in parquet
                            let col_index = (field["id"].as_i64().unwrap_or(0) - 1) as usize;
                            let data_type = field["data_type"].as_str().expect("data_type should be string");

                            let value = match data_type {
                                "integer" => row.get_int(col_index).unwrap_or_default().to_string(),
                                "string" => {                                    row.get_string(col_index).unwrap().clone()
                                },
                                ...
                                _ => String::from(""),
                            };
                            values.push(value);
                        }

                        row_list.push(values.join(","));
                    }
                }
            }
        }
    }

This module is responsible for reconstructing structured data from fragmented files. It doesn't read a single file; it traverses the directory structure associated with a table.

It scans for all .parquet files within the table's path, allowing for distributed data storage.
By aligning the 0-indexed columns of Parquet with our 1-indexed metadata IDs, the reader ensures that data is extracted into the correct fields.
Using SerializedFileReader, the reader iterates through rows and joins values into a comma-separated format for the consuming application.

After PoC

Building datalake format from the bottom has been an exercise of understanding delicate relationship between high-level data abstractions and low-level storage efficiency. In this part, I’ve established an early-stage—data format that can be used to both Java and Rust, manage its own schema, and handle basic I/O operations using Parquet.

But this is only starting point. My goal is not to replace this with current datalake format (like iceberg or hudi), but to improve this until it reaches in production level.

I'll keep share advanced steps for this.

Forem: kination

Skill to make you slow, but to go fast

kination / slow-slow-quick-quick

Skills for slo-mo, to go fast later.

slow-slow-quick-quick

Skills

Make slowly, to go fast later.

Strategic Implementation with slow-vibe-coding

Architectural Integrity with slow-vibe-sw-architect

And one more, slow-dev-research

Conclusion

build-my-own-datalake: Improve metadata with caching

Building a High-Performance Metadata System with Global Caching

The Initial Implementation

It's just a small JSON file. Why does it matter?

Syscall overhead

Page cache doesn't eliminate overhead

Industry validation: this is a known problem

Solution: three-tier caching

Implementation: global cache layer

Implementation: schema-on-read fallback

Implementation: handling schema mismatches

Operational impact

What I've learned

1. Cache at the right granularity

2. Optimize for the common case

3. Separate write and read semantics

4. Use file formats as schema sources

Comparison: why global cache matters

No caching will make filesystem access on every write:

Trade-offs and limitations

What I gave up

What I gained

What's next

Conclusion

Code repository

References

The Problem: Metadata is the Hidden Bottleneck

The Initial Implementation

The Solution: Three-Tier Caching Strategy

Why Three Layers?

Implementation: Global Cache Layer

Global Cache Implementation

Writer Path with Global Cache

Implementation: Schema-on-Read Fallback

Fallback Chain Implementation

Schema Inference from Vortex Files

Implementation: Handling Schema Mismatches

Lenient Schema Matching

Performance Results

Benchmark Setup

Write Performance

Read Performance

Memory Overhead

Lessons Learned

1. Cache at the Right Granularity

2. Optimize for the Common Case

3. Separate Write and Read Semantics

4. Use File Formats as Schema Sources

Operational Impact

Zero-Downtime Schema Changes

Resilient to Metadata Loss

Write-Optimized Validation

Comparison: Why Global Cache Matters

No Caching (Baseline)

Per-Instance Caching

Global Caching (Our Approach)

Trade-offs and Limitations

What We Gave Up

What We Gained

Future Optimizations

Short-Term (v0.3.0)

Medium-Term (v0.4.0)

Conclusion

Code Repository

Fundamental matters more in AI era

Shift is Real

Two Paths Forward: Strategy in the Age of Autopilot

Don't Just Accept -> Validate

Opening the Black Box: Back to the Metal

Strategic Implementation with `slow-vibe-coding`

Architectural Integrity with `slow-vibe-sw-architect`

And one more, `slow-dev-research`

Work without `sudo`

What is `Spark Connect`?