Forem: CharmPic

Language Barriers: A Struggle for Japanese Developers on Dev.to

CharmPic — Sun, 12 Apr 2026 06:58:52 +0000

As a Japanese developer, I love browsing Dev.to to keep up with the latest tech trends. However, I often face a significant "wall" that hinders my learning experience: the language barrier.

The "I Can't Read English Fast Enough" Problem
Let’s be honest—reading long technical articles in English is exhausting when it's not your native language. Even if a headline looks incredibly interesting, the psychological hurdle of clicking on a wall of English text is surprisingly high.
The Limitations of Browser Translation
You might say, "Just use Google Translate or built-in browser features!" But it’s not that simple:

Friction: Having to manually trigger translation for every single page is a tedious extra step.

Accuracy: Standard browser translations often struggle with technical context. They sometimes mangle code snippets or turn specific jargon into nonsensical Japanese, forcing me to switch back to the original text anyway.

A Dream Feature: AI-Powered Native Translation I often find myself wishing Dev.to would implement an integrated AI translation feature.

With the power of modern LLMs, we could have context-aware, high-quality translations at the click of a button. Imagine a "Read in Japanese" toggle right next to the article!

I understand that API costs are a major concern, making this a difficult feature to implement for free. But it’s painful to think about how many amazing insights I’m missing out on just because of the language gap. T^T

I’d love to hear from you:
How do non-native English speakers handle this? Do you use any specific tools or extensions that make your Dev.to experience smoother?

Re-evaluating the ROI of GLM-5.1 Pro After a Massive Price Hike to $680

CharmPic — Sun, 12 Apr 2026 06:29:49 +0000

Headline: GLM-5.1 Pro Price Hike: 3x Increase to $680/year — Time to Look for Alternatives?

I recently received some shocking news regarding the GLM-5.1 Pro plan.
The annual subscription, which used to be a reasonable $180, has suddenly spiked to over $680. That is a staggering 3x increase.

To be fair, the GLM-5.1 Pro plan offered incredible value. Its performance and limits were comparable to the Claude Code $200/month (Max) tier, making it a "hidden gem" for developers. Even at $680/year, one could argue it still offers decent value considering the high-end capabilities.

However, a 600% price jump changes the equation. At this price point, we can no longer ignore other major AI players in the market. It’s time to start comparing the cost-to-performance ratio against other leading LLMs again.

Instant Glory: The App That Makes Every Coder a DEV Challenge Winner

CharmPic — Fri, 10 Apr 2026 17:20:15 +0000

This is a submission for the DEV April Fools Challenge

The Ultimate Ego Booster: Challenge Winner Simulator 2026

What I Built

Have you ever felt the unbearable emptiness of not winning a DEV challenge? The sleepless nights. The existential dread. The nagging suspicion that your code may not be "useless" enough to qualify for greatness?

I built a gloriously unnecessary victory machine that solves all of that.

Challenge Winner Simulator 2026 is a delightfully over-the-top praise engine: enter your name, and the app transforms into a full-blown cosmic celebration of your alleged brilliance. You get dramatic compliments, absurd statistics, galactic proclamations, a cinematic Star Wars-style credit crawl, and enough visual excess to convince any developer that they are, in fact, the chosen one.

And because the joke simply refused to stay in the browser, I also built a Windows desktop version of the app with Flutter and WebView2. So now the same majestic nonsense can be launched as a native Windows app, packaged like a serious piece of software despite being fundamentally unserious in every possible way.

Demo

Web demo:

https://moe-charm.github.io/dev_challenges/20260411winner/index.html

Windows release:

https://github.com/moe-charm/dev_challenges/releases/tag/winner-simulator-20260411

Click the "CELEBRATE!" button to trigger the full auditory and visual experience.

And yes, the music absolutely matters.

Code

Web version:

https://github.com/moe-charm/dev_challenges/tree/main/20260411winner

Windows version:

https://github.com/moe-charm/dev_challenges/tree/main/winner_simulator_app

Release build:

https://github.com/moe-charm/dev_challenges/releases/tag/winner-simulator-20260411

How I Built It

I wanted this to feel both ridiculous and weirdly overengineered, so I kept the web version lightweight while piling on just enough spectacle to make it feel expensive.

Vanilla HTML/CSS/JS: no framework, no mercy, just pure DOM manipulation and theatrical confidence.
CSS transforms and animations: used for the big cinematic crawl, dramatic fades, glowing text, and all the unnecessary grandeur.
Canvas API: used for fireworks and particle effects so the whole thing could sparkle like it was accepting an award nobody asked for.
Web Audio API: used for fanfares, drum rolls, cat-like sounds, and the kind of BGM that insists your name deserves a standing ovation.
i18n logic: supports both English and Japanese, because winning should be internationally embarrassing.
Flutter + WebView2: for the Windows desktop edition, which embeds the same HTML challenge into a standalone app so the joke can live outside the browser too.
Embedded assets: the Windows build packages the challenge inside the app, so it can be distributed as a release ZIP without needing a separate content folder.

The whole project is intentionally excessive for something fundamentally useless, which is exactly what made it fun to build.

Prize Category

Best Ode to Larry Masinter

This project is basically a shrine to the spirit of playful protocol absurdity. It leans hard into the glorious nonsense of 418 I'm a teapot, celebrates the ritual of turning a tiny joke into a grand experience, and fully embraces the idea that the web can be both technically elaborate and completely ridiculous at the same time.

The Golden Teapot is not just a trophy. It is a philosophy.

Google AI Usage (Best Google AI Usage Entry)

This entire project was built in a deep pair-programming session with Antigravity, Google’s agentic AI coding assistant.
Antigravity wasn't just a code generator; it acted as a "Dramatic Consultant" and "Vibe Architect." Here’s how we used it:

Rapid Prototyping: Antigravity generated the complex CSS 3D transforms for the credit crawl and the Canvas-based firework engine from scratch.
Agentic Iteration: We iterated on the visual "wow factor" by asking the AI to "make it more over-the-top" and "add more galactic energy," which led to the inclusion of glitch effects, screen shakes, and dynamic starfields.
Creative Writing: The AI helped craft the hyperbolic, universe-shattering narratives in the crawl and world reaction sections, ensuring the "uselessness" was presented with the highest possible prestige.
Sound Engineering: The AI assisted in integrating the Web Audio API for real-time sound synthesis while managing the external BGM integration.
Cross-Platform Escalation: When our ambitions got too big and we decided to build a native offline Windows desktop app via Flutter, the AI (along with a strategic assist from ChatGPT for Windows WebView2 virtual hosting) helped us bypass local CORS restrictions and materialize the embedded assets natively! The collaboration felt less like "writing code" and more like "directing a digital movie." AI allowed me to focus on the humor and vision while it handled the heavy lifting of the visual and auditory implementation.

NyanZip: The Delightfully Useless Cat-Language Compression App

CharmPic — Fri, 03 Apr 2026 04:44:05 +0000

This is a submission for the DEV April Fools Challenge

NyanZip: The Delightfully Useless Cat-Language Compression App

What I Built

I built NyanZip, a browser-based joke app that takes normal text and turns it into exaggerated cat language.

It has two modes:

A playful "compress" mode that expands text into a noisy stream of meow and MEOW!!
An "ultra" mode that uses real compression under the hood, but still wraps everything in cat-themed nonsense

It also includes optional Gemini-powered cat commentary, because every compression tool deserves a tiny, judgmental reviewer in a bow tie.

The result is intentionally impractical, a little chaotic, and exactly the kind of project that feels right for April Fools.

Demo

Live demo: https://moe-charm.github.io/dev_challenges/20260402april/

Repository: https://github.com/moe-charm/dev_challenges

Code

The code is all in the repository above. The main pieces are:

App shell and UI: 20260402april/index.html
Main interaction logic: 20260402april/app.js
Cat-text encoder and decoder: 20260402april/js/engine.js
Ultra compression pipeline: 20260402april/js/rans.js
Optional AI cat review feature: 20260402april/js/chat.js
Localization strings: 20260402april/js/i18n.js

How I Built It

I built NyanZip with plain HTML, CSS, and JavaScript.

A few things made it fun to put together:

TextEncoder and TextDecoder for converting text to bytes and back
CompressionStream and DecompressionStream for the ultra mode
A simple bilingual UI for Japanese and English
Optional Gemini integration for the cat review comments

I also published it with GitHub Pages so the joke works directly in the browser, with no setup required.

Prize Category

I’d submit this for Community Favorite.

It is intentionally silly, easy to try, and built to make people smile first and ask questions later. The optional Google AI feature adds a fun extra layer, but the core joke stands on its own.

Hakozuna v3.2 Released: Bringing Optimized Memory Allocation to M1 Mac

CharmPic — Thu, 19 Mar 2026 23:29:20 +0000

I've added the main text to chatgpt5.4 ↓

I am pleased to announce the release of Hakozuna v3.2.
While my previous update focused on Windows, this release marks a significant milestone: Full support for M1 Mac.

GitHub Release:https://github.com/hakorune/hakozuna

Zenodo Record: 19120414

DOI: 10.5281/zenodo.19120414

What is Hakozuna?
Hakozuna is a memory allocator designed for small objects, built upon the Box Theory framework. It is currently split into two specialized lineages:

hz3: Optimized for local-heavy / low-RSS workloads.

hz4: Optimized for remote-heavy / high-thread environments.

What’s New in the M1 Mac Update
The primary goal of this update was to establish a seamless workflow on M1 Mac—encompassing development, observation, and running benchmarks for academic papers.

Key Improvements:

Refined Mac Entrypoints: All Mac-specific logic is now consolidated in the mac/ directory.

Pipeline Separation: Decoupled the Build Lane and Observe Lane for better modularity.

Streamlined Paper-Suite: You can now run the full suite of benchmarks required for research papers with a single setup.

Comparative Benchmarking: Integrated mimalloc and tcmalloc into the suite to allow direct performance comparisons against hz3 and hz4.

Performance Insights: Where it Shines
Testing the paper-suite on Mac revealed clear strengths for each allocator:

hz3 showed dominant performance in the Larson benchmark.

hz4 took the lead in MT remote (Multi-threaded remote free) scenarios.

In Redis-like workloads, the winner shifted depending on the specific workload characteristics.

Note on mimalloc-bench: In our subset tests, certain malloc-large treatments were flagged as "no-go."

Segment Registry: For high-remote conditions, we found that slots=32768 yielded better results.

The Takeaway:
The M1 Mac results reinforce our core philosophy: rather than trying to create a "one-size-fits-all" allocator, partitioning "boxes" based on specific conditions leads to superior efficiency.

Conclusion
With v3.2, "Mac support" is more than just a port—it is a functional environment ready for rigorous academic benchmarking.

Summary of Gains:

Improved DX (Developer Experience) on M1 Mac.

Automated and reliable comparative benchmarking via paper-suite.

Clearer functional boundaries between the hz3 and hz4 lineages.

Next, I plan to utilize this Mac environment to refine the supplementary data and further validate my research for the upcoming paper.

Porting Hakozuna to Windows Native: Lessons from Benchmarking hz3 and hz4 beyond Ubuntu

CharmPic — Tue, 10 Mar 2026 14:36:09 +0000

The Windows native support for Hakozuna has finally moved past the "it runs" stage to the "measurable and comparable" stage.

Previously, my allocator research was focused on Ubuntu. The major milestone here is that the entire pipeline—from source builds to application benchmarks—is now fully operational on Windows.

The TL;DR: hz3 remains incredibly strong on Windows. Meanwhile, while hz4 is functional and reproducible, it hasn't yet consistently outperformed others in real-world application benchmarks on Windows without specific tuning. Investigation is ongoing.

What’s New?
This update isn't just about successful compilation. I've established a robust foundation for comparative allocator research on Windows:

Native Comparisons: Capability to benchmark hz3, hz4, mimalloc, tcmalloc, and CRT on Windows.

Real-world Workloads: Support for not just synthetic benchmarks, but also real-world Redis and memcached-style loads.

Infrastructure: Organized public runners, documentation, and benchmark summary repositories.

Publications: Updated both Japanese and English versions of the research paper with Windows-specific appendices.

Distribution: Updated GitHub Releases, Zenodo, and public PDFs.

The Challenges of Windows Porting
Porting to Windows was far from a simple "copy-paste" from Linux. The difficulties lay less in the allocator's hot path and more in the surrounding ecosystem:

Build Toolchains: Significant differences in build boxes and environments.

Linking Nuances: Handling DLL vs. static link mode variations.

OS-Specific APIs: Architecting around VirtualAlloc paths.

Porting Workloads: Bringing memcached, memtier, and Redis into a native Windows environment.

Fixed Costs: Noticing OS-specific fixed costs that were negligible on Linux but prominent on Windows.

Interestingly, some design choices and default "knobs" that worked perfectly for hz4 on Ubuntu didn't translate into a winning strategy for Windows application benchmarks. This highlights the fascinating—and exhausting—reality of how an allocator's behavior changes depending on the OS.

Key Benchmark Findings
While the Ubuntu results remain the primary baseline, the Windows native tests revealed:

hz3 Dominance: Highly performant in real Redis workloads (balanced, kv_only, list_only, highpipe).

Workload Sensitivity: In memcached external-client tests, the "winning" allocator shifts depending on the specific workload.

hz4 Potential: While hz4 shows promise in synthetic benchmarks with specific tuning, it showed mixed signals in real Redis balanced tests.

Current Verdict:

Default: Use hz3.

Research Focus: Use hz4 for remote-heavy and high-thread count scenarios.

Paper and Release Updates
I've synchronized all assets with this release:

Updated Japanese & English PDFs.

Added Windows Native supplemental tables.

GitHub Release v3.1 & Zenodo v3.1 (with updated DOI).

Latest papers are available directly in the repo at docs/paper/main_ja.pdf and main_en.pdf.

Personal Insights
The most intriguing discovery was seeing "boxes" (design components) that were unremarkable on Ubuntu suddenly show significant impact on Windows—and vice versa.

The gap between "performing well in synthetics" and "winning in real apps" is crucial. It’s a stark reminder that in allocator research, looking "fast" on paper matters far less than proving which workload you actually conquer.

What’s Next?
"Completion" of Windows support actually means reaching a level of maturity where research can truly begin. Moving forward, I plan to:

Further optimize hz4 specifically for Windows.

Refine common profiles and OS-specific configurations for Ubuntu/Windows.

Improve paper and documentation readability.

Evolve the "Box Theory" into the next architectural phase.

If there’s interest, my next posts will dive deeper into:

Why I separated hz3 and hz4.

How to design an allocator using Box Theory.

The technical nuances of what looks different on Windows compared to Linux.

https://github.com/hakorune/hakozuna

Does Audio Cable Affect Sound? I Built a Physics Simulator to Find Out.

CharmPic — Sat, 07 Mar 2026 07:54:38 +0000

Discussing whether audio cables change the sound often turns into a never-ending debate based on subjective impressions.
"Long RCA cables make the sound feel muddy."
"Silver wires add a glossy texture to the highs."
"Extremely long speaker cables lose their punch."

These anecdotes have existed for decades, but personal experiences alone don't reach a conclusion. Conversely, simple IDEAL circuit models often fail to explain what audiophiles actually hear.

That's why I decided to stop looking at the cable in isolation. Instead, I built a simulator that treats the entire signal path as a single physical system:

Physical characteristics of the cable

Interaction with connected equipment

Amplifier behavior

Response degradation in the time domain

Analytical metrics closer to human perception

The Project
GitHub: https://github.com/moe-charm/audio-chain-physics

Live Demo: https://audio-chain-physics.streamlit.app/

Zenodo DOI: https://doi.org/10.5281/zenodo.18898657

What I Built
Audio Chain Physics is a research-oriented simulator designed to handle the audio chain in stages. It models the following layers:

RLGC Model of the Cable: Moving beyond simple resistance.

Interface Interaction: Output impedance, input capacitance, and common return paths.

Non-linear Elements & Small-signal Stability: How the amplifier reacts to the load.

Dielectric Absorption: Approximating "trailing" responses.

Time-Domain Analysis: Group delay, impulse response, step response, and "TailRatio."

The core philosophy is not just analyzing the cable itself, but how the cable changes the operating conditions of the equipment.

For example, the propagation delay of a 3-meter cable is negligible. It’s too small to be an "audible difference" on its own. However, when you combine Output Impedance + Cable Capacitance + Load Capacitance + Amp Phase Margin + Complex Speaker Load, the settling time, group delay, and ringing change. This synergy is likely what manifests as a perceptible difference in sound.

What the Simulator Reveals (So Far)
The most significant result is that under extreme conditions, sound quality degradation can be clearly reproduced through calculation.

In scenarios such as:

Long, poor-quality RCA cables

Excessively long speaker cables

Amplifiers with high output impedance

The simulation clearly shows:

High-frequency attenuation or peaking

Distorted group delay

"Tails" in the impulse response

Ringing in the step response

Drop in Damping Factor

This confirms a solid starting point: If cables or connection conditions are handled poorly, the response of the entire audio chain will degrade.

What Still Remains Unexplained
To be completely honest, I haven't yet fully explained the more subtle audible differences I’ve experienced myself—those "nuances" that are hard to put into numbers:

The sense of "glossiness" or "air" in the highs.

Changes in "soundstage" or "depth."

The feeling of the sound becoming "thicker" or "denser."

While I can reproduce degradation in extreme cases, I cannot yet claim to have simulated those delicate nuances under "normal" high-end equipment conditions.

This is my current conclusion: While degradation due to extreme conditions is reproducible, explaining subtle sonic nuances requires further research. This feels like the most honest scientific stance at this stage.

Future Roadmap
There is still a lot of work to be done. My focus will move toward:

Validation: Comparing simulation results with real-world measurement data.

Complex Loads: Supporting measured speaker impedance curves.

Stochastic Models: Modeling RF interference, hum, and poor contact points.

Controlled Listening Tests: Correlating sim data with subjective perception.

I want to move from "degradation happens at extremes" to "explaining why we hear what we hear in everyday listening."

If you’re interested in the physics of audio, please check out the GitHub repo or the live demo!

GitHub: https://github.com/moe-charm/audio-chain-physics

Live Demo: https://audio-chain-physics.streamlit.app/

Building My Own Image Engine "HakoNyans": I Beat PNG, but WebP is a Wall

CharmPic — Sat, 21 Feb 2026 03:05:12 +0000

Hi DEV community! 👋

I originally started developing a custom image engine called HakoNyans for a DEV Challenge. The challenge has ended, but my passion hasn't—I'm still actively building and refining it every day.

The Milestone: Beating PNG
I'm happy to report that HakoNyans has officially surpassed PNG! Getting the architecture and logic right to beat such a classic, widely-used format was a huge milestone for this project.

The Current Struggle: The WebP Wall
However, my next target is WebP, and let me tell you... WebP is incredibly strong.

Here is where HakoNyans currently stands against WebP:

Processing Speed: Currently 20% slower.

Compression Ratio: Losing by 20% (resulting in larger file sizes).

I've hit a bit of a wall. Competing with WebP's highly optimized compression logic is no easy task. Closing this 20% gap in both speed and size is my current obsession.

Let's Discuss!
I'm continually pushing the limits of my current algorithms, but I'd love to hear from the community.

Has anyone else here tried building an image codec or data compression engine from scratch?

What optimization techniques or mental models helped you break through performance plateaus?

Any advice, feedback, or shared experiences would be greatly appreciated. I'll keep pushing HakoNyans forward!

4 Months of Developing a Memory Allocator: Updating "Hakozuna" to v3.0 (hz3/hz4)

CharmPic — Tue, 17 Feb 2026 20:13:31 +0000

4 Months of Developing a Memory Allocator: Updating "Hakozuna" to v3.0 (hz3/hz4)

Introduction

I am excited to announce the release of Hakozuna, a high-performance memory allocator.

Repository: [https://github.com/hakorune/hakozuna]
Paper & Artifacts (Zenodo v3.0): [https://zenodo.org/records/18674502]

Over the past four months, I’ve been through countless cycles of implementation and benchmarking, optimizing the performance against industry standards like mimalloc and tcmalloc.

The biggest takeaway from this journey? Instead of trying to create a "one-size-fits-all" configuration to win every race, the real solution was to branch out into specialized profiles based on use cases.

hz3: Optimized for local-heavy / Redis-like workloads.
hz4: Optimized for remote-heavy / high-thread workloads.

What is Hakozuna?

Hakozuna is built on Box Theory—a design philosophy centered on aggregating boundaries to isolate responsibilities. During development, I prioritized:

Zero overhead in the hot path: Eliminating unnecessary operations where it matters most.
Reversibility: Ensuring every optimization can be toggled via compile flags.
Observability: Using A/B benchmarking and one-shot counters to make performance data transparent.

Benchmark Summary (Ubuntu Native, Representative Values at RUNS=10)

MT lane x remote% (Ops/s)

Configuration	`hz3`	`hz4`	`mimalloc`	`tcmalloc`
main_r0	375.4M	137.4M	224.2M	232.7M
main_r50	66.5M	78.1M	17.9M	84.3M
main_r90	62.6M	67.6M	13.0M	54.9M
cross128_r90	1.80M	50.65M	10.94M	7.50M

Redis-like (Median ops/s)

hz3: 571,199
mimalloc: 568,740
tcmalloc: 568,052
hz4: 560,576

Choosing the Right Version

For most scenarios, I recommend starting with hz3 as the default. Switch to hz4 only if your workload is strictly remote-heavy or involves extremely high thread counts.

# hz3 (Default)
cd hakozuna/hz3
make clean all_ldpreload_scale
LD_PRELOAD=./libhakozuna_hz3_scale.so ./your_app

# hz4 (Remote-heavy / High-thread)
cd ../hz4
make clean all
LD_PRELOAD=./libhakozuna_hz4.so ./your_app

Key Lessons Learned

Through this development process, I gained two major insights:

Identify "NO-GOs" Early Documenting and archiving optimizations that didn't work was just as important as the successes. Moving on quickly from failed paths ultimately accelerated the final progress.
There is no single "winning" path Stability in performance numbers only came after separating the logic: hz3 for local-heavy and hz4 for remote-heavy/high-thread counts. Specialization is the key to outperforming general-purpose allocators in specific niches.

HakoNyans: A Transparent Lossless Codec Challenge (with GitHub Copilot CLI)

CharmPic — Sat, 14 Feb 2026 17:11:27 +0000

This is a submission for the GitHub Copilot CLI Challenge

## What I Built

I built HakoNyans, an experimental image codec focused on practical decode speed and transparent lossless results
across different image types (photo, anime, UI/screen).

For this challenge, I focused on two things:

A clearer lossless workflow in CLI
Added a new command:
hakonyans encode-lossless <in.ppm> <out.hkn> [preset: fast|balanced|max]
This makes lossless testing reproducible from the terminal without custom scripts.
Reproducible visual and metric snapshots
Prepared a dedicated asset pack for challenge demos:
docs/assets/devchallenge_2026_01_21/
Included side-by-side comparisons and both win and lose cases (to keep reporting honest).

What this project means to me:

I wanted to show not just “best-case screenshots,” but a realistic engineering snapshot:
where HKN already wins (some natural-photo cases), and where PNG is still much stronger (some structured/UI-like cases).

## Demo

Repository:
https://github.com/hakorune/HakoNyans
Challenge asset pack:
https://github.com/hakorune/HakoNyans/tree/main/docs/assets/devchallenge_2026_01_21

### Screenshots

Main comparison

Main comparison metrics (Artoria, lossless)

Source PNG (Artoria): 17,669,320 bytes
Env: HAKONYANS_THREADS=1
Command: ./build/hakonyans encode-lossless <in.ppm> <out.hkn> <preset>
Encode time: 2-run median wall time

| Preset | HKN bytes | PNG bytes | PNG/HKN | Encode time |
| balanced | 17,374,550 | 17,669,320 | 1.0170 | 2.94 s |
| max | 16,767,516 | 17,669,320 | 1.0538 | 101.18 s |

PNG/HKN > 1.0 means HKN is smaller.

Lossless win case (nature_01)

Lossless lose case (hd_01)

### Example metrics snapshot (fixed6, max preset)

nature_01: HKN 812,567 bytes vs PNG 1,281,481 bytes (PNG/HKN = 1.577)
nature_02: HKN 999,685 bytes vs PNG 1,446,470 bytes (PNG/HKN = 1.447)
hd_01: HKN 699,858 bytes vs PNG 8,785 bytes (PNG/HKN = 0.0126)

So currently:

HKN can beat PNG on some photo-like content.
PNG still dominates some structured/worst-case images.
The project is actively improving both sides.

## My Experience with GitHub Copilot CLI

GitHub Copilot CLI was most useful for my fast implementation loop:

Refactoring large headers into smaller units safely
Adding a new CLI command (encode-lossless) while preserving existing behavior
Running repetitive verify loops quickly (build, ctest, benchmark checks)
Generating and validating challenge-ready visual assets from command pipelines

The biggest impact was speed + consistency.
I could iterate quickly in terminal-first workflows while keeping changes verifiable (tests, checksums, RMSE checks,
benchmark CSVs).

AVX2 SIMD Optimization for 12-bit JPEG Decoding in libjpeg-turbo — Pair Programming with Copilot CLI

CharmPic — Tue, 10 Feb 2026 15:49:32 +0000

This is a submission for the GitHub Copilot CLI Challenge

## What I Built

I added AVX2 SIMD optimizations to libjpeg-turbo's 12-bit JPEG decoding pipeline, achieving 4.6% speedup on
Full HD and 2.5% on 4K images.

libjpeg-turbo is the world's most widely used JPEG library, with highly optimized SIMD paths for 8-bit JPEG. However,
12-bit JPEG (used in medical imaging and high-precision workflows) had zero SIMD support — everything ran as
scalar C code.

Using perf profiling, I identified 3 hotspots and implemented AVX2 intrinsics for each:

| Target | Implementation | Impact |
| --- | --- | --- |
| IDCT (Inverse DCT) | 64-bit arithmetic + AVX2 parallelization | ~3% |
| YCC→RGB Color Conversion | SIMD compute + packed RGB interleave output | ~3% |
| H2V2 Fancy Upsample | 16-bit SIMD weighted interpolation | ~1.8% |

### Why "just 4.6%" matters

libjpeg-turbo is already one of the most optimized codebases in existence. Profiling reveals that 37.6% of CPU
time is spent in Huffman decoding — which is structurally impossible to SIMD-ize due to the sequential
bit-dependency in the JPEG spec. The SIMD-able portion (IDCT + color conversion + upsampling ≈ 44%) was effectively
optimized across all three targets.

📊 Benchmarks (AMD Ryzen 9 9950X, GCC 13.3.0, -O3)

| Resolution | Before | After | Improvement |
| --- | --- | --- | --- |
| Full HD (1920×1080) | 27.87 ms | 26.58 ms | 4.6% |
| 4K (3840×2160) | 113.07 ms | 110.26 ms | 2.5% |

🧪 All 662 tests pass — JPEG compliance tests allow zero tolerance for bit-level differences.

## Demo

🔗 Repository: moe-charm/dev_libjpeg-turbo-12bit-simd

### Profiling Breakdown (4K 12-bit JPEG)

37.63%  decode_mcu                    ← Huffman decoding (cannot SIMD)
21.95%  jsimd_idct_islow_avx2_12bit   ← ✅ AVX2 optimized
11.38%  ycc_rgb_convert               ← ✅ AVX2 optimized
10.57%  h2v2_fancy_upsample           ← ✅ AVX2 optimized
 8.25%  put_rgb                       ← File I/O
 5.00%  jpeg_fill_bit_buffer          ← Bitstream parsing

### Key Implementation Detail

// 12-bit samples are 16-bit → widen to 32-bit for arithmetic → pack back to 16-bit
m256i y = _mm256_cvtepu16_epi32(_mm_loadu_si128((m128i *)inptr0));
// ... AVX2 YCC→RGB conversion ...
__m256i r16 = _mm256_packus_epi32(r, zero); // 32-bit → 16-bit pack

## My Experience with GitHub Copilot CLI

This entire project was built exclusively through Copilot CLI in the terminal — no IDE involved.

### 🔍 The Profile → Implement → Verify Cycle

Copilot CLI handled perf record / perf report execution and analysis, AVX2 intrinsics code generation, and running
all 662 ctest tests — all within a single terminal session.

What stood out:

Profiling-driven prioritization — After running perf, Copilot analyzed the results and suggested which function to optimize next based on CPU time share. This data-driven approach kept the work focused on high-impact targets.
AVX2 intrinsics generation — Instructions like _mm256_packus_epi32, _mm256_permute4x64_epi64, and _mm256_cvtepu16_epi32 are notoriously hard to get right without reading Intel manuals. Copilot generated correct sequences and understood the cross-lane behavior of AVX2.
Debugging bit-level failures — The 12-bit IDCT initially had 1-bit rounding errors that failed compliance tests. Copilot helped diagnose the overflow issue and switch from 32-bit to 64-bit intermediate arithmetic to fix it.
A/B testing infrastructure — Copilot proposed and implemented the JPEG12_IDCT_FORCE_C environment variable for toggling between SIMD and scalar paths, enabling clean before/after benchmarking.

### 💡 Why CLI Was Perfect for This

SIMD optimization lives and dies by the "write → build → test → profile → analyze → rewrite" loop. Copilot CLI
keeps this cycle entirely within the terminal — no context switching to an editor. Run cmake --build, see the
result, fix the code, run 662 tests, benchmark — all in one continuous conversation.

### ⚠️ Challenges

Rare type system: libjpeg-turbo's 12-bit types (J12SAMPLE, J12SAMPROW, J12SAMPARRAY) barely exist in training data. Copilot initially generated dispatch logic using the compile-time BITS_IN_JSAMPLE macro, but the correct approach requires runtime data_precision checks — since libjpeg-turbo builds a single binary supporting multiple precisions.
Measuring small gains: When your baseline is already world-class, proving that a 2-3% improvement is real (not noise) requires careful benchmark design with multiple runs and statistical analysis.

Built with GitHub Copilot CLI + libjpeg-turbo 3.1.x on AMD Ryzen 9 9950X / Ubuntu / GCC 13.3.0

I suffered a crushing defeat against mimalloc in specific tuning scenarios.

CharmPic — Sun, 25 Jan 2026 05:42:49 +0000

The Scalability Wall (T=16) At 16 threads, hz3 hit a limit.

hz3: 76.6M ops/s

mimalloc: 85.0M ops/s

hz4 (AI): 106.3M ops/s My allocator struggles to scale linearly at high thread counts compared to the AI's brute-force approach.

The "Aggressive Purge" Defeat In the default benchmarks, hz3 used less memory (1.36GB) than mimalloc (1.52GB). However, when we enabled mimalloc's aggressive memory release mode (purge_delay=0), the tables turned dramatically.

mimalloc (tuned): 0.52 GB 😱

hz3 (tuned): 1.39 GB

Result: hz3 used 2.7x more memory.

mimalloc has a highly sophisticated page reclaiming system that I haven't implemented yet. While hz3 holds onto memory to keep speed up, mimalloc can strip down to the bare metal when asked. This is a clear victory for mimalloc's engineering maturity.

Forem: CharmPic

Language Barriers: A Struggle for Japanese Developers on Dev.to

Re-evaluating the ROI of GLM-5.1 Pro After a Massive Price Hike to $680

Instant Glory: The App That Makes Every Coder a DEV Challenge Winner

The Ultimate Ego Booster: Challenge Winner Simulator 2026

What I Built

Demo

Code

How I Built It

Prize Category

Google AI Usage (Best Google AI Usage Entry)

NyanZip: The Delightfully Useless Cat-Language Compression App

NyanZip: The Delightfully Useless Cat-Language Compression App

What I Built

Demo

Code

How I Built It

Prize Category

Hakozuna v3.2 Released: Bringing Optimized Memory Allocation to M1 Mac

Porting Hakozuna to Windows Native: Lessons from Benchmarking hz3 and hz4 beyond Ubuntu

Does Audio Cable Affect Sound? I Built a Physics Simulator to Find Out.

Building My Own Image Engine "HakoNyans": I Beat PNG, but WebP is a Wall

4 Months of Developing a Memory Allocator: Updating "Hakozuna" to v3.0 (hz3/hz4)

4 Months of Developing a Memory Allocator: Updating "Hakozuna" to v3.0 (hz3/hz4)

Introduction

What is Hakozuna?

Benchmark Summary (Ubuntu Native, Representative Values at RUNS=10)

MT lane x remote% (Ops/s)

Redis-like (Median ops/s)

Choosing the Right Version

Key Lessons Learned

Links

HakoNyans: A Transparent Lossless Codec Challenge (with GitHub Copilot CLI)

AVX2 SIMD Optimization for 12-bit JPEG Decoding in libjpeg-turbo — Pair Programming with Copilot CLI

I suffered a crushing defeat against mimalloc in specific tuning scenarios.