<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Wayne</title>
    <description>The latest articles on Forem by Wayne (@wheynelau).</description>
    <link>https://forem.com/wheynelau</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3898242%2F77ad6a26-606a-4f53-a83c-55494768faf9.jpeg</url>
      <title>Forem: Wayne</title>
      <link>https://forem.com/wheynelau</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/wheynelau"/>
    <language>en</language>
    <item>
      <title>Ansible at Home</title>
      <dc:creator>Wayne</dc:creator>
      <pubDate>Mon, 27 Apr 2026 05:17:16 +0000</pubDate>
      <link>https://forem.com/wheynelau/ansible-at-home-1ig5</link>
      <guid>https://forem.com/wheynelau/ansible-at-home-1ig5</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;While some concepts in Engineering should not be brought back to home, I find that Ansible was one of the few tools that is actually useful in a home environment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Ansible?
&lt;/h2&gt;

&lt;p&gt;Ansible is a powerful automation tool that can help manage and configure systems efficiently. Another key important aspect commonly missed out is documentation. In the past, I would usually SSH into my home server and make changes directly. If I remember to document it, I would save code snippets into a README.md or obsidian note. However, this approach is prone to human error and can lead to inconsistencies over time. Most forms of IaC (Infrastructure as Code) tools are self documenting, as the code itself serves as documentation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setup and Configuration
&lt;/h2&gt;

&lt;p&gt;Before diving into use cases, it's important to set up Ansible properly for a home environment. The configuration is straightforward and makes running playbooks much more convenient.&lt;/p&gt;

&lt;p&gt;I keep two files in the project directory: an &lt;code&gt;ansible.cfg&lt;/code&gt; pointing to my inventory file and enabling &lt;code&gt;become_ask_pass&lt;/code&gt; so it prompts for sudo passwords rather than storing credentials (security first, even at home), and an &lt;code&gt;inventory.ini&lt;/code&gt; with at least &lt;code&gt;localhost ansible_connection=local&lt;/code&gt; so playbooks run locally without SSH overhead. With those in place, &lt;code&gt;ansible-playbook playbook.yml&lt;/code&gt; just works.&lt;/p&gt;

&lt;h2&gt;
  
  
  Use Cases
&lt;/h2&gt;

&lt;h3&gt;
  
  
  System configuration
&lt;/h3&gt;

&lt;p&gt;I am using a consumer intel CPU with a stock cooler for my homelab. As I don't expect it to run heavy workloads, I don't need it to run at full power. I set the PL1 and PL2 power limits through Ansible rather than using the &lt;code&gt;intel-undervolt&lt;/code&gt; tool. This way, if I ever need to re-install the OS or set up a new server, I can easily apply the same configuration without having to remember the exact commands or settings.&lt;/p&gt;

&lt;p&gt;The playbook validates that PL1 and PL2 values fall within acceptable ranges (hard lower/upper limits) and ensures PL2 &amp;gt;= PL1 before applying them. It then writes to the sysfs powercap interfaces to set sustained and burst power limits, and creates a systemd service for persistence across reboots.&lt;/p&gt;

&lt;p&gt;It's easy to make mistakes when setting raw power values -- adding one extra zero can be disastrous. With Ansible, I can specify values like 65W and 90W instead of 65000000 and 90000000, and the validation layer catches out-of-range inputs before they get applied.&lt;/p&gt;

&lt;h3&gt;
  
  
  Restic setup
&lt;/h3&gt;

&lt;p&gt;Restic is a great backup tool that can be used to back up data to various locations. The backup scripts are manually written, but the cron jobs and log rotation are managed by Ansible. If the below process were to be done manually, it would be prone to human error and inconsistencies, as multiple files are involved, such as cron jobs, log rotation configuration, and the backup scripts themselves.&lt;/p&gt;

&lt;p&gt;The playbook ensures restic is installed, makes the backup scripts executable, sets up two daily cron jobs (one for immich data at 2 AM, one for documents at 3 AM), and configures logrotate to keep 21 days of compressed logs with daily rotation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;As you add more services and configurations to your home environment, the benefits of using Ansible become even more apparent. It helps maintain consistency, reduces the risk of human error, and serves as documentation for your setup. Whether you're managing a single server or multiple devices, Ansible can streamline your home automation tasks effectively.&lt;/p&gt;

&lt;p&gt;The full version with the complete playbook examples is on &lt;a href="https://wheynelau.dev/posts/2025-08-12-ansible-at-home/" rel="noopener noreferrer"&gt;my blog&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>ansible</category>
      <category>homelab</category>
      <category>iac</category>
    </item>
    <item>
      <title>Making Compression a Habit with zstd</title>
      <dc:creator>Wayne</dc:creator>
      <pubDate>Sun, 26 Apr 2026 13:47:27 +0000</pubDate>
      <link>https://forem.com/wheynelau/making-compression-a-habit-with-zstd-2gie</link>
      <guid>https://forem.com/wheynelau/making-compression-a-habit-with-zstd-2gie</guid>
      <description>&lt;p&gt;With zstd being added to &lt;a href="https://docs.python.org/3/library/compression.zstd.html" rel="noopener noreferrer"&gt;Python 3.14&lt;/a&gt;, I've been using compressed files more often in my workflow. Here's what I've learned about making compression a habit.&lt;/p&gt;

&lt;h2&gt;
  
  
  Python Data Processing with Compression
&lt;/h2&gt;

&lt;p&gt;Python 3.14 adds native &lt;code&gt;zstd.open()&lt;/code&gt; support, which is a big step forward. Here's the comparison:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Before 3.14&lt;/strong&gt; (with &lt;code&gt;zstandard&lt;/code&gt; package):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;zstandard&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;zstd&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;io&lt;/span&gt;

&lt;span class="c1"&gt;# Writing compressed JSONL with Zstandard
&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Alice&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;95&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bob&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;87&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Charlie&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;92&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# Write
&lt;/span&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;data.jsonl.zst&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;wb&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;cctx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;zstd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ZstdCompressor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;level&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;cctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stream_writer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;writer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;writer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Read
&lt;/span&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;data.jsonl.zst&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;rb&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;dctx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;zstd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ZstdDecompressor&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;dctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stream_reader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;reader&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;text_stream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;io&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;TextIOWrapper&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reader&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;text_stream&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Python 3.14+&lt;/strong&gt; is much simpler:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;compression&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;zstd&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="c1"&gt;# Read and print first record
&lt;/span&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;zstd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;data.jsonl.zst&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;break&lt;/span&gt;  &lt;span class="c1"&gt;# Remove break to read all lines
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The API mirrors regular &lt;code&gt;open()&lt;/code&gt; -- just use &lt;code&gt;zstd.open()&lt;/code&gt; instead.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key points:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use &lt;code&gt;'wt'&lt;/code&gt; mode for writing text, &lt;code&gt;'rt'&lt;/code&gt; for reading&lt;/li&gt;
&lt;li&gt;Typical compression ratio: 6-7x size reduction at &lt;code&gt;zstd-3&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Benchmarking Your Workload
&lt;/h2&gt;

&lt;p&gt;You should benchmark compression according to your workload to determine your trade-offs.&lt;/p&gt;

&lt;p&gt;For archival of logs or long-term storage, you can use higher compression levels of &lt;code&gt;zstd&lt;/code&gt;. Archives like Pushshift Reddit typically use level 22. For most use cases, &lt;code&gt;zstd-3&lt;/code&gt; is a good default.&lt;/p&gt;

&lt;h2&gt;
  
  
  Working with Compressed Files
&lt;/h2&gt;

&lt;p&gt;Zstd includes tools for viewing, searching, and processing compressed files without manual decompression.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Quick commands:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;zstdcat data.json.zst&lt;/code&gt; -- view the contents&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;zstdless data.json.zst&lt;/code&gt; -- page through like &lt;code&gt;less&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;zstdgrep "error" events.json.zst&lt;/code&gt; -- search inside compressed files&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;zstdgrep -c "timeout" events.json.zst&lt;/code&gt; -- count occurrences&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can also pipe to other tools: &lt;code&gt;zstdcat events.json.zst | grep ERROR | jq '.timestamp'&lt;/code&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Transferring files
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Rsync
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;-z&lt;/code&gt; flag compresses data during transfer. On highly compressible files, rsync may report &lt;code&gt;speedup &amp;gt; 1.0x&lt;/code&gt;. Here's a test with about 66GB of JSONL files:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sent 50,322 bytes  received 12,451,737,167 bytes  19,290,143.28 bytes/sec
total size is 66,857,841,487  speedup is 5.37
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That speedup means the data sent was much less than the original file size. This is highly beneficial if you're network bound or concerned about egress costs.&lt;/p&gt;

&lt;h3&gt;
  
  
  S3 and Cloud Storage
&lt;/h3&gt;

&lt;p&gt;AWS charges for outbound data transfer (egress). Compressing data before storage can significantly reduce these costs. With a 7.0x compression ratio, a $14,000 egress bill drops to roughly $2,000.&lt;/p&gt;

&lt;p&gt;Here's an upload comparison on a gigabit connection with a 4GB JSONL file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Compressed upload (zstd -k -c ... | s5cmd pipe)&lt;/span&gt;
real    0m8.139s
&lt;span class="c"&gt;# Result in S3: 363.7MB&lt;/span&gt;

&lt;span class="c"&gt;# Uncompressed upload (s5cmd cp)&lt;/span&gt;
real    0m57.547s
&lt;span class="c"&gt;# Result in S3: 4.0GB&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The same principle works for downloading: &lt;code&gt;s5cmd cat s3://bucket/data.zst | zstd -d &amp;gt; data.jsonl&lt;/code&gt;. Compression takes longer than decompression, but the speedup is usually worth it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;I use &lt;code&gt;zstdcat&lt;/code&gt; to read files and rarely need to edit them in an IDE. This habit cut my text storage by up to 80%. There's a balance between convenience, speed, and storage - and this works for me. More optimized formats like protobuf or arrow exist, but most text processing still uses JSON.&lt;/p&gt;

&lt;p&gt;The full version with code examples and benchmarks is on &lt;a href="https://wheynelau.dev/posts/compression-with-ztsd/" rel="noopener noreferrer"&gt;my blog&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>python</category>
      <category>linux</category>
      <category>performance</category>
      <category>compression</category>
    </item>
    <item>
      <title>Using hf tokenizers in Rust</title>
      <dc:creator>Wayne</dc:creator>
      <pubDate>Sun, 26 Apr 2026 13:43:42 +0000</pubDate>
      <link>https://forem.com/wheynelau/using-hf-tokenizers-in-rust-1k5p</link>
      <guid>https://forem.com/wheynelau/using-hf-tokenizers-in-rust-1k5p</guid>
      <description>&lt;p&gt;The &lt;code&gt;tokenizers&lt;/code&gt; library from Hugging Face provides an efficient way to work with text tokenization in Rust. This guide shows you how to get started with pretrained tokenizers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setup
&lt;/h2&gt;

&lt;p&gt;First, add the tokenizer library to your project:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;cargo add tokenizers &lt;span class="nt"&gt;--features&lt;/span&gt; http,hf-hub
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Basic Usage
&lt;/h2&gt;

&lt;p&gt;Here's a complete example that loads a pretrained tokenizer and processes text:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;tokenizers&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Tokenizer&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="nb"&gt;Box&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="k"&gt;dyn&lt;/span&gt; &lt;span class="nn"&gt;std&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;error&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Error&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nb"&gt;Send&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nb"&gt;Sync&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Load a pretrained tokenizer&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;tokenizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;Tokenizer&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"hf-internal-testing/llama-tokenizer"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"This is a sample string to tokenize"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="c1"&gt;// Encode the text (false = no special tokens)&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;encoding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="nf"&gt;.encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;false&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="c1"&gt;// Get token IDs&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;token_ids&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="nf"&gt;.get_ids&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="nd"&gt;println!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Token IDs: {:?}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;token_ids&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// Get token text&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="nf"&gt;.get_tokens&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="nd"&gt;println!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Tokens: {:?}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="nd"&gt;println!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Original: {}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nd"&gt;println!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Number of tokens: {}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;token_ids&lt;/span&gt;&lt;span class="nf"&gt;.len&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;

    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;decoded&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="nf"&gt;.decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;token_ids&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;true&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nd"&gt;println!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Original: {}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nd"&gt;println!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Decoded: {}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;decoded&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(())&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Working with Different Models
&lt;/h2&gt;

&lt;p&gt;You can use various pretrained models:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="c1"&gt;// GPT-2 tokenizer&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;gpt_tokenizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;Tokenizer&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"gpt2"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;// BERT tokenizer&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;bert_tokenizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;Tokenizer&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"bert-base-uncased"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;// Llama tokenizer&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;llama_tokenizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;Tokenizer&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"hf-internal-testing/llama-tokenizer"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Configuration
&lt;/h2&gt;

&lt;p&gt;To change the cache directory for downloaded models, set the &lt;code&gt;HF_HOME&lt;/code&gt; environment variable:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;HF_HOME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/path/to/your/cache
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Setting environment variables programmatically is not recommended as it requires an unsafe block. &lt;/p&gt;

&lt;h3&gt;
  
  
  Private Repositories
&lt;/h3&gt;

&lt;p&gt;If you encounter this error:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;Error: RequestError&lt;span class="o"&gt;(&lt;/span&gt;Status&lt;span class="o"&gt;(&lt;/span&gt;401, Response[status: 401, status_text: Unauthorized, url: https://huggingface.co/google/gemma-3-12b-it/resolve/main/tokenizer.json]&lt;span class="o"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It means you are not authenticated and may require a token. There are two ways to achieve this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Write your token to $HF_HOME/token, usually $HOME/.cache/huggingface&lt;/li&gt;
&lt;li&gt;Within Rust code:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;tokenizers&lt;/span&gt;&lt;span class="p"&gt;::{&lt;/span&gt;&lt;span class="n"&gt;Tokenizer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;FromPretrainedParameters&lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;FromPretrainedParameters&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;Some&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"&amp;lt;your very secret token&amp;gt;"&lt;/span&gt;&lt;span class="nf"&gt;.to_string&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
    &lt;span class="o"&gt;..&lt;/span&gt;&lt;span class="nn"&gt;Default&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;default&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;tokenizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;Tokenizer&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"google/gemma-3-4b-it"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;Some&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Note that you may still need to get permission to access the repos.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Branches
&lt;/h3&gt;

&lt;p&gt;You can specify a specific branch or revision:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;tokenizers&lt;/span&gt;&lt;span class="p"&gt;::{&lt;/span&gt;&lt;span class="n"&gt;Tokenizer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;FromPretrainedParameters&lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;FromPretrainedParameters&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;revision&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"main"&lt;/span&gt;&lt;span class="nf"&gt;.to_string&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;  &lt;span class="c1"&gt;// or specific commit hash&lt;/span&gt;
    &lt;span class="o"&gt;..&lt;/span&gt;&lt;span class="nn"&gt;Default&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;default&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;tokenizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;Tokenizer&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"google/gemma-3-4b-it"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;Some&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  User-Agent
&lt;/h3&gt;

&lt;p&gt;Params have another variable called &lt;code&gt;user_agent&lt;/code&gt; for customizing the HTTP client user agent string.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;tokenizers&lt;/span&gt;&lt;span class="p"&gt;::{&lt;/span&gt;&lt;span class="n"&gt;Tokenizer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;FromPretrainedParameters&lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;FromPretrainedParameters&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;user_agent&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;Some&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"my-rust-app/1.0"&lt;/span&gt;&lt;span class="nf"&gt;.to_string&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
    &lt;span class="o"&gt;..&lt;/span&gt;&lt;span class="nn"&gt;Default&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;default&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;tokenizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;Tokenizer&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"gpt2"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;Some&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;The Hugging Face &lt;code&gt;tokenizers&lt;/code&gt; library provides a robust, production-ready solution for text processing in Rust applications. With support for pretrained models, authentication for private repositories, and flexible configuration options, it's an excellent choice for NLP workflows in Rust.&lt;/p&gt;

&lt;p&gt;You can find this post and more on &lt;a href="https://wheynelau.dev/posts/2025-11-21-tokenizer-rust/" rel="noopener noreferrer"&gt;my blog&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>rust</category>
      <category>machinelearning</category>
      <category>nlp</category>
      <category>huggingface</category>
    </item>
    <item>
      <title>Setting Up Docker CI for Rust with cargo-dist</title>
      <dc:creator>Wayne</dc:creator>
      <pubDate>Sun, 26 Apr 2026 13:29:54 +0000</pubDate>
      <link>https://forem.com/wheynelau/setting-up-docker-ci-for-rust-with-cargo-dist-36nn</link>
      <guid>https://forem.com/wheynelau/setting-up-docker-ci-for-rust-with-cargo-dist-36nn</guid>
      <description>&lt;h1&gt;
  
  
  Rust CI
&lt;/h1&gt;

&lt;h2&gt;
  
  
  The Core Idea
&lt;/h2&gt;

&lt;p&gt;Building Rust inside Docker is slow. A typical multi-stage Dockerfile compiles the binary in one stage and copies it into a minimal image in another. That works fine for local builds, but in CI it takes a long time, especially when you're emulating arm64 through QEMU.&lt;/p&gt;

&lt;p&gt;The better approach: let cargo-dist handle the compilation as part of the release workflow. By the time the Docker job runs, the binaries are already built and available as GitHub Actions artifacts. Docker just copies them in. QEMU is still needed for the final multi-arch manifest, but it's only moving files around rather than running a compiler through emulation, so arm64 builds don't take nearly as long.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;The starting point is the &lt;a href="https://axodotdev.github.io/cargo-dist/book/quickstart/rust.html" rel="noopener noreferrer"&gt;cargo-dist quickstart guide&lt;/a&gt;. Once that's in place, you need a few configuration pieces to trigger the Docker build after the release.&lt;/p&gt;

&lt;p&gt;In &lt;code&gt;release.yml&lt;/code&gt;, add a &lt;code&gt;custom-docker-publish&lt;/code&gt; job that calls your docker-publish workflow and passes the plan output and binary name as inputs.&lt;/p&gt;

&lt;p&gt;In &lt;code&gt;dist-workspace.toml&lt;/code&gt;, set &lt;code&gt;post-announce-jobs&lt;/code&gt; to point at your docker workflow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="py"&gt;post-announce-jobs&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"./docker-publish"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="py"&gt;github-custom-job-permissions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="py"&gt;"docker-publish"&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="py"&gt;packages&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"write"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="py"&gt;contents&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"read"&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="py"&gt;allow-dirty&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"ci"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The permissions block was needed because my docker workflow didn't have enough access by default.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Docker Workflow
&lt;/h2&gt;

&lt;p&gt;The workflow runs as a &lt;code&gt;workflow_call&lt;/code&gt; and takes the dist plan JSON, binary name, and target triple suffix as inputs. Here's the overall structure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;workflow_call&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;inputs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;plan&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;required&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;string&lt;/span&gt;
      &lt;span class="na"&gt;binary_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;required&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;string&lt;/span&gt;
      &lt;span class="na"&gt;target_triple_suffix&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;required&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;string&lt;/span&gt;
        &lt;span class="na"&gt;default&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unknown-linux-musl"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The job itself:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Set up QEMU and Docker Buildx&lt;/li&gt;
&lt;li&gt;Log in to GHCR&lt;/li&gt;
&lt;li&gt;Extract the version from the dist plan's &lt;code&gt;announcement_tag&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Generate Docker metadata (semver tags, major.minor, major, and latest for non-prereleases)&lt;/li&gt;
&lt;li&gt;Download the amd64 and arm64 artifacts produced by cargo-dist&lt;/li&gt;
&lt;li&gt;Extract and normalize the artifacts, moving binaries into the right folders&lt;/li&gt;
&lt;li&gt;Build and push with &lt;code&gt;docker/build-push-action@v6&lt;/code&gt; targeting both &lt;code&gt;linux/amd64&lt;/code&gt; and &lt;code&gt;linux/arm64&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The version tags are pulled from the dist plan, so they stay in sync with cargo-dist's release process. The &lt;code&gt;latest&lt;/code&gt; tag is skipped for prereleases.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Build and push&lt;/span&gt;
  &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;docker/build-push-action@v6&lt;/span&gt;
  &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;context&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;.&lt;/span&gt;
    &lt;span class="na"&gt;platforms&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;linux/amd64,linux/arm64&lt;/span&gt;
    &lt;span class="na"&gt;push&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ steps.meta.outputs.tags }}&lt;/span&gt;
    &lt;span class="na"&gt;build-args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;BINARY_NAME=${{ inputs.binary_name }}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Dockerfile
&lt;/h2&gt;

&lt;p&gt;The Dockerfile depends on what your binary needs. I used distroless images and determined the right base image by running &lt;code&gt;ldd&lt;/code&gt; on the compiled binary:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;linux-vdso.so.1 &lt;span class="o"&gt;(&lt;/span&gt;0x00007ffdfb764000&lt;span class="o"&gt;)&lt;/span&gt;
libgcc_s.so.1 &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; /lib/x86_64-linux-gnu/libgcc_s.so.1
libm.so.6 &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; /lib/x86_64-linux-gnu/libm.so.6
libc.so.6 &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; /lib/x86_64-linux-gnu/libc.so.6
/lib64/ld-linux-x86-64.so.2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Since this binary needed libc and libm, I went with &lt;code&gt;gcr.io/distroless/cc-debian13:nonroot&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="s"&gt; gcr.io/distroless/cc-debian13:nonroot&lt;/span&gt;

&lt;span class="k"&gt;ARG&lt;/span&gt;&lt;span class="s"&gt; TARGETARCH&lt;/span&gt;
&lt;span class="k"&gt;ARG&lt;/span&gt;&lt;span class="s"&gt; BINARY_NAME&lt;/span&gt;

&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; --chmod=755 artifacts/${TARGETARCH}/${BINARY_NAME} /usr/local/bin/app&lt;/span&gt;

&lt;span class="k"&gt;EXPOSE&lt;/span&gt;&lt;span class="s"&gt; 8000&lt;/span&gt;

&lt;span class="k"&gt;USER&lt;/span&gt;&lt;span class="s"&gt; nonroot:nonroot&lt;/span&gt;

&lt;span class="k"&gt;ENTRYPOINT&lt;/span&gt;&lt;span class="s"&gt; ["/usr/local/bin/app"]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The full version with the complete workflow YAML and more context is on &lt;a href="https://wheynelau.dev/posts/2026-02-06-rust-ci/" rel="noopener noreferrer"&gt;my blog&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>rust</category>
      <category>docker</category>
      <category>ci</category>
      <category>githubactions</category>
    </item>
    <item>
      <title>Learnings of the Poor</title>
      <dc:creator>Wayne</dc:creator>
      <pubDate>Sun, 26 Apr 2026 06:44:18 +0000</pubDate>
      <link>https://forem.com/wheynelau/learnings-of-the-poor-2086</link>
      <guid>https://forem.com/wheynelau/learnings-of-the-poor-2086</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Necessity is the mother of invention&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I was already GPU poor, but a recent job change combined with rising component prices have also made me RAM and NVMe poor.&lt;/p&gt;

&lt;p&gt;While I am nowhere close to the experts of optimisations in the early 2000s or 90s, I took this time to brush up on some fundamentals and key concepts in Python. As the saying goes:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Premature optimisation is the root of all evil"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;We are not looking for very deep level optimisations, these changes aim to follow the Pareto Principle where 80% of the outcome comes from 20% of the effort. The changes below may or may not be 20% effort but I would consider them low-effort.&lt;/p&gt;

&lt;p&gt;As such, there won't be any discussion on performance profiling, where we are determining hot loops, cache misses, memory reallocations etc.&lt;/p&gt;

&lt;h2&gt;
  
  
  Iterators
&lt;/h2&gt;

&lt;p&gt;Frankly I think this is an important concept that has a great carryover regardless of languages. Understanding iterators also helps if you need to think of channels, which is very important in Go.&lt;/p&gt;

&lt;p&gt;The typical approach collects results at every stage into lists:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;read_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data.jsonl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;first_filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;second_processing&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;write_processed_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output.jsonl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The issue: if &lt;code&gt;data.jsonl&lt;/code&gt; is bigger than your RAM, you run OOM very fast. Using &lt;code&gt;yield&lt;/code&gt; instead keeps memory usage low:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;collections.abc&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Iterator&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;read_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Iterator&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;first_filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Iterator&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Iterator&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;input_data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;is_good&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each function in the pipeline takes an &lt;code&gt;Iterator[dict]&lt;/code&gt; and yields records one at a time. Memory usage drops significantly.&lt;/p&gt;

&lt;p&gt;Caveats:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Files are held open throughout the pipeline, so unintentional edits or moves will break it.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;json.dumps&lt;/code&gt; does not add a trailing newline, so &lt;code&gt;f.write(json.dumps(record) + '\n')&lt;/code&gt; is intentional when writing JSONL.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Learning points
&lt;/h3&gt;

&lt;p&gt;I find that iterators are a step before understanding pipelines, channels, or pub/sub patterns. When you understand iterators, you understand the bottlenecks of your code. They are fundamentally all iterators that consume and yield.&lt;/p&gt;

&lt;p&gt;If &lt;code&gt;process_data&lt;/code&gt; is slow (1 line per second) while reading and filtering is fast (4 lines per second), the pipeline is bounded by 1 line per second. The solution is more processing workers bridged through queues or channels:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Read-worker-1 -&amp;gt; Filter-worker-1 -&amp;gt; Process-worker-{1..4} -&amp;gt; Write-worker-1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Compression
&lt;/h2&gt;

&lt;p&gt;In my &lt;a href="https://wheynelau.dev/posts/compression-with-ztsd/" rel="noopener noreferrer"&gt;Compression&lt;/a&gt; post, I mentioned that benchmarks should be done to know whether your use case supports compressions. For write once, read many scenarios, higher compression values may help.&lt;/p&gt;

&lt;p&gt;Here is a measurement for an IO-constrained scenario (reading a JSONL file from NAS):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;ZST: 100000it [00:05, 17220.01it/s]  (9.47 MB/s)
Raw: 100000it [00:40, 2492.39it/s]  (11.15 MB/s)
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Because data is compressed, you can read more data per buffer. More lines are stored per MB of compressed JSONL compared to its raw form.&lt;/p&gt;

&lt;h2&gt;
  
  
  Less is more
&lt;/h2&gt;

&lt;p&gt;Less work means more efficient processing. It's about eliminating wasted work, not always adding a cache everywhere.&lt;/p&gt;

&lt;p&gt;If filtering takes 1s per line and processing takes 5s per line:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Process then filter on 10000 lines: &lt;code&gt;10000 * 6s = 60000s&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Filter then process on 10000 lines (50% bad): &lt;code&gt;10000 * 1s + 5000 * 5s = 35000s&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No complex code, no need for compiled languages. Algorithmic complexity matters too. Choosing the right data structure — a set for membership checks instead of a list, a deque instead of a list for queue operations — can eliminate entire classes of wasted work regardless of language.&lt;/p&gt;

&lt;p&gt;The full version with code examples and benchmarks is on &lt;a href="https://wheynelau.dev/posts/2026-03-27-learnings-of-the-poor/" rel="noopener noreferrer"&gt;my blog&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>python</category>
      <category>optimization</category>
      <category>iterators</category>
      <category>programming</category>
    </item>
    <item>
      <title>How to Benchmark LLM Inference Performance: TTFT, ITL, and Throughput Metrics</title>
      <dc:creator>Wayne</dc:creator>
      <pubDate>Sun, 26 Apr 2026 05:05:46 +0000</pubDate>
      <link>https://forem.com/wheynelau/how-to-benchmark-llm-inference-performance-ttft-itl-and-throughput-metrics-416p</link>
      <guid>https://forem.com/wheynelau/how-to-benchmark-llm-inference-performance-ttft-itl-and-throughput-metrics-416p</guid>
      <description>&lt;p&gt;When deploying large language models to production, measuring performance accurately is critical. Whether you're using vLLM, SGLang, TensorRT-LLM, or a custom inference stack, you need to understand:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Throughput&lt;/strong&gt;: How many requests per second can your system handle?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency metrics&lt;/strong&gt;: Time to First Token (TTFT), Inter-Token Latency (ITL), and end-to-end latency&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Token generation speed&lt;/strong&gt;: Tokens per second under different concurrency levels&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tail latency&lt;/strong&gt;: P95 and P99 values that affect user experience&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In this post, I'll walk through the key metrics for benchmarking language models and share why I built &lt;a href="https://github.com/wheynelau/llmperf-rs" rel="noopener noreferrer"&gt;llmperf-rs&lt;/a&gt;, a Rust-based benchmarking tool that takes a different approach to measuring these metrics.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem with Existing Tools
&lt;/h2&gt;

&lt;p&gt;While working with &lt;a href="https://github.com/ray-project/llmperf" rel="noopener noreferrer"&gt;ray-project/llmperf&lt;/a&gt; (now archived), I noticed that Inter-Token Latency (ITL) was calculated by averaging per-request first, then aggregating those averages. This approach works well for many use cases, but I needed to preserve individual latency spikes during testing.&lt;/p&gt;

&lt;p&gt;There's also &lt;a href="https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/perf_analyzer/genai-perf/README.html" rel="noopener noreferrer"&gt;genai-perf&lt;/a&gt;, which is very comprehensive. My only issue was running it on Ubuntu 22.04 without Docker. As of this update, they've sunsetted &lt;code&gt;genai-perf&lt;/code&gt; in favor of &lt;a href="https://github.com/ai-dynamo/aiperf" rel="noopener noreferrer"&gt;aiperf&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.vllm.ai/en/latest/benchmarking/cli/#dataset-overview" rel="noopener noreferrer"&gt;vllm-bench&lt;/a&gt; is solid too, but requires installing &lt;code&gt;vllm&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The goal was to build a simple binary that runs almost anywhere with minimal dependencies. It was also a learning project.&lt;/p&gt;

&lt;h2&gt;
  
  
  Metrics
&lt;/h2&gt;

&lt;p&gt;This is a summary of the full &lt;a href="https://github.com/wheynelau/llmperf-rs/blob/master/docs/metrics.md" rel="noopener noreferrer"&gt;metrics documentation&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Time To First Token (TTFT)
&lt;/h3&gt;

&lt;p&gt;TTFT measures how quickly the model begins responding after receiving your request. For interactive applications, this is the perceived latency before the user sees any output. It's also important for RAG-based applications where a large chunk of processing happens at the prefill stage.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;TTFT = first_token_timestamp - request_start_timestamp&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Lower is better.&lt;/p&gt;

&lt;h3&gt;
  
  
  Inter-Token Latency (ITL)
&lt;/h3&gt;

&lt;p&gt;ITL is the time between consecutive tokens during generation. Spikes can reveal multiple issues, most commonly network problems. ITL is usually consistent due to how KV caches and the computation works.&lt;/p&gt;

&lt;p&gt;When testing against vLLM, I noticed that high ITL spikes happen when you benchmark close to the context limit. I suspect this is due to vLLM's eviction of requests if they exceed the KV cache size.&lt;/p&gt;

&lt;p&gt;For example, if 3 requests come in with &lt;code&gt;0.8x&lt;/code&gt; context length and &lt;code&gt;0.2x&lt;/code&gt; for generation, but the GPU has space for only &lt;code&gt;2.8x&lt;/code&gt; context length, one of the requests will be preempted.&lt;/p&gt;

&lt;p&gt;Aggregation: concatenate ALL ITL values across all responses, then compute statistics. Each response produces &lt;code&gt;(N-1)&lt;/code&gt; ITL values (where &lt;code&gt;N&lt;/code&gt; is the token count). By aggregating raw values instead of per-request averages, you preserve the true distribution including outliers.&lt;/p&gt;

&lt;h3&gt;
  
  
  Throughput Metrics
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Prefill TPS&lt;/strong&gt; — tokens processed per second during the prefill phase:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Prefill TPS = input_tokens / TTFT&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;However, prefill TPS doesn't accurately reflect system performance because TTFT includes queue wait time, not just actual processing time. When a server is under load, your request might sit in a queue waiting for resources. The lower prefill TPS in that case reflects queue contention, not the system's processing capability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Decode TPS&lt;/strong&gt; — tokens generated per second during the decode phase:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Decode TPS = output_tokens / (final_time - decode_start_time)&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;This is the generation speed: how fast the model produces output.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Matters Most
&lt;/h2&gt;

&lt;p&gt;For production serving, focus on &lt;strong&gt;TTFT&lt;/strong&gt;, &lt;strong&gt;ITL stats&lt;/strong&gt;, and maybe &lt;strong&gt;RPM&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TTFT&lt;/strong&gt; measures how quickly users see their first token — this is the perceived responsiveness of your system.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ITL statistics&lt;/strong&gt; reveal decode-phase issues that throughput metrics hide. The 99th percentile and max ITL values expose preemption events from KV cache limits and network issues between components.&lt;/p&gt;

&lt;p&gt;ITL matters less for batch jobs or non-streaming APIs where users don't watch tokens arrive in real-time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Token Counting
&lt;/h2&gt;

&lt;p&gt;Accurate metrics require accurate token counts. llmperf-rs handles this in two ways:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;API response&lt;/strong&gt; — Most OpenAI-compatible endpoints return token counts in the &lt;code&gt;usage&lt;/code&gt; field. By default, llmperf-rs uses this as priority.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tokenizer&lt;/strong&gt; — For exact input counts, pass a HuggingFace tokenizer. Note that chat templates may cause &amp;lt;10 token variance.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The original llmperf uses a single tokenizer for all models. Different models use different tokenizers, so llmperf-rs lets you specify the correct one or rely on API-reported counts.&lt;/p&gt;

&lt;p&gt;For example, Llama-2 has a vocab size of 32000, while Qwen3-4B has 151936. In my own testing, setting input tokens to 8192 against a Qwen endpoint while using the default llama tokenizer returned values around 7363-7376 tokens.&lt;/p&gt;

&lt;h2&gt;
  
  
  Validating Your Results
&lt;/h2&gt;

&lt;p&gt;All benchmark runs should end with &lt;code&gt;finish_reason = length&lt;/code&gt; (meaning the model hit the &lt;code&gt;max_tokens&lt;/code&gt; limit). If you see &lt;code&gt;finish_reason = stop&lt;/code&gt;, the model stopped early. This affects metrics like RPM and E2E latency. Higher rejection rates can produce higher RPMs and lower latency due to shorter responses.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Use llmperf-rs
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use llmperf-rs when:&lt;/strong&gt; running benchmarks with minimal dependencies, testing OpenAI-compatible endpoints, wanting low overhead (Rust, no Ray/ZMQ), or needing a quick way to test endpoints.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Consider alternatives when:&lt;/strong&gt; you need GPU-level metrics (use trtllm-bench or aiperf), testing vLLM-specific features, requiring extensive reporting dashboards, or needing distributed testing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why ITL Matters Even When Throughput Looks Good
&lt;/h2&gt;

&lt;p&gt;High throughput with bad ITL means tokens arrive in bursts, and chat users notice the choppy streaming. ITL spikes (p99 &amp;gt;100ms) often indicate preemption, network issues, or other problems. For non-user-facing use cases like agentic coding, throughput may matter more than ITL specifics.&lt;/p&gt;

&lt;p&gt;The full version with code examples, benchmarks, and installation instructions is on &lt;a href="https://wheynelau.dev/posts/2025-12-15-benchmarking-performance/" rel="noopener noreferrer"&gt;my blog&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>benchmarking</category>
      <category>rust</category>
      <category>performance</category>
    </item>
  </channel>
</rss>
