<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Muhammad umair akram</title>
    <description>The latest articles on Forem by Muhammad umair akram (@anticrusader).</description>
    <link>https://forem.com/anticrusader</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3906268%2F1c87f62c-11aa-4288-8e42-b78cc4018763.png</url>
      <title>Forem: Muhammad umair akram</title>
      <link>https://forem.com/anticrusader</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/anticrusader"/>
    <language>en</language>
    <item>
      <title>Why CRNN is overkill for fixed-length CAPTCHA OCR — a 6-digit case study (100% Accuracy)</title>
      <dc:creator>Muhammad umair akram</dc:creator>
      <pubDate>Fri, 01 May 2026 17:06:24 +0000</pubDate>
      <link>https://forem.com/anticrusader/why-crnn-is-overkill-for-fixed-length-captcha-ocr-a-6-digit-case-study-100-accuracy-5g</link>
      <guid>https://forem.com/anticrusader/why-crnn-is-overkill-for-fixed-length-captcha-ocr-a-6-digit-case-study-100-accuracy-5g</guid>
      <description>&lt;p&gt;Most automation projects in regulated industries hit the same wall&lt;br&gt;
  eventually: a CAPTCHA on an internal portal blocks the very automation&lt;br&gt;
  the team is trying to build.&lt;/p&gt;

&lt;p&gt;In our case, the ops team needed to interact with one of our company's&lt;br&gt;
  internal portals dozens of times per day. The portal — built by an&lt;br&gt;
  internal team, used only by employees — gates access with a 6-digit&lt;br&gt;
  numeric CAPTCHA on every login. Reasonable security choice for the&lt;br&gt;
  original threat model. Not so reasonable for the team that needs to&lt;br&gt;
  script repetitive workflows on top of it.&lt;/p&gt;

&lt;p&gt;The right fix would have been to add a service-account API to the&lt;br&gt;
  portal team's backlog. The realistic fix, given the timeline, was&lt;br&gt;
  to teach a small ML model to read those CAPTCHAs reliably so our&lt;br&gt;
  automation script could move past the login screen.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A note on legitimacy, since this article will inevitably be skimmed&lt;br&gt;
by people wondering: this is internal automation on a portal owned&lt;br&gt;
by my employer, used only inside the company, accessed with explicit&lt;br&gt;
authorization to automate. It's the same shape of work as RPA — same&lt;br&gt;
legal/ethical category. Solving CAPTCHAs on third-party websites you&lt;br&gt;
don't own is a different conversation entirely, often a TOS violation&lt;br&gt;
and sometimes illegal depending on jurisdiction. Don't conflate the&lt;br&gt;
two.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;With that framing out of the way, the technical question was: what's&lt;br&gt;
  the right model architecture for fixed-length numeric CAPTCHA OCR?&lt;/p&gt;

&lt;p&gt;The default answer most engineers reach for is CRNN — a CNN encoder&lt;br&gt;
  followed by an LSTM/GRU decoder, trained with CTC loss. It's the&lt;br&gt;
  standard recipe in every "deep learning for OCR" tutorial. And for&lt;br&gt;
  variable-length text recognition (handwritten notes, scanned documents,&lt;br&gt;
  scene text), CRNN is genuinely the right choice.&lt;/p&gt;

&lt;p&gt;But our CAPTCHA was always exactly 6 digits. Always 0–9. No variation&lt;br&gt;
  in length, no edge cases, no character set ambiguity. The structure&lt;br&gt;
  was completely known.&lt;/p&gt;

&lt;p&gt;When the structure of your input is known, the right architectural&lt;br&gt;
  move is to lean into that structure — not reach for the most general&lt;br&gt;
  possible model. So I skipped CRNN and built something simpler: a&lt;br&gt;
  shared CNN backbone with six independent classification heads (one&lt;br&gt;
  per digit position), tied together with learnable position embeddings.&lt;/p&gt;

&lt;p&gt;The result was 100% accuracy on our held-out test set with about&lt;br&gt;
  4,000 training samples. Here's how it's built and why each design&lt;br&gt;
  choice mattered.&lt;/p&gt;
&lt;h2&gt;
  
  
  CRNN vs. multi-head: a brief architecture comparison
&lt;/h2&gt;

&lt;p&gt;CRNN is the standard recipe for OCR: a CNN encoder pulls features from the image, a recurrent layer (LSTM or GRU) decodes those features into a sequence, and CTC loss handles the alignment between the predicted sequence and the ground-truth label without requiring per-character supervision. It's a powerful approach because it handles &lt;em&gt;variable-length&lt;/em&gt; outputs gracefully — the same model can predict a 4-character word, a 10-character phrase, or a 50-character sentence.&lt;/p&gt;

&lt;p&gt;The trade-off is complexity. CRNN has more moving parts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The recurrent decoder adds parameters and training instability&lt;/li&gt;
&lt;li&gt;CTC loss has its own learning dynamics and edge cases (alignment collapse,
blank-token tuning)&lt;/li&gt;
&lt;li&gt;Inference is sequential — harder to parallelize across positions&lt;/li&gt;
&lt;li&gt;Debugging is harder — when the model outputs "13456" instead of "123456," you need
to figure out whether that's a recognition error, an alignment error, or a length
error&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Compare that to a multi-head approach for fixed-length output. The input image runs &lt;br&gt;
  through a shared CNN backbone once, producing a single feature vector. Then six&lt;br&gt;&lt;br&gt;
  independent classification heads each predict one digit (0–9) at one specific&lt;br&gt;&lt;br&gt;
  position. The training signal is straightforward: six cross-entropy losses, one per &lt;br&gt;
  position, averaged. No sequence decoding. No alignment. No CTC.&lt;/p&gt;

&lt;p&gt;The structure is simpler in every dimension that matters:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fewer parameters in the decoder&lt;/li&gt;
&lt;li&gt;Faster training convergence (more stable gradient signal per output)&lt;/li&gt;
&lt;li&gt;Faster inference (six parallel classifications, no sequential decode)&lt;/li&gt;
&lt;li&gt;Easier debugging (each head's output is independent and inspectable)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The only thing CRNN gives you that multi-head doesn't is variable-length support.&lt;br&gt;&lt;br&gt;
  And we explicitly didn't need that.&lt;/p&gt;

&lt;p&gt;The general principle worth taking away: if your task has known structure (fixed&lt;br&gt;&lt;br&gt;
  length, fixed character set, fixed slot count), encode that structure in your&lt;br&gt;&lt;br&gt;
  architecture instead of asking a more general model to learn it. You'll get faster&lt;br&gt;&lt;br&gt;
  training, fewer parameters, and better sample efficiency.&lt;/p&gt;
&lt;h2&gt;
  
  
  Position embeddings: the design choice that made the shared backbone work
&lt;/h2&gt;

&lt;p&gt;The naive version of multi-head architecture has a subtle weakness. All six output&lt;br&gt;&lt;br&gt;
  heads consume the same feature vector from the shared backbone. If the backbone&lt;br&gt;&lt;br&gt;
  produces a feature vector &lt;code&gt;f&lt;/code&gt;, then each head simply looks at &lt;code&gt;f&lt;/code&gt; and emits a&lt;br&gt;&lt;br&gt;
  prediction:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;  &lt;span class="n"&gt;prediction_position_1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;head_1&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;prediction_position_2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;head_2&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="c1"&gt;# ... etc., all six heads see the same f
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The shared &lt;code&gt;f&lt;/code&gt; has to encode "what character is at position 1 AND what character is &lt;br&gt;
  at position 2 AND ... AND what character is at position 6" — all simultaneously, in &lt;br&gt;
  the same vector. With enough training data and capacity, the model can learn this.&lt;br&gt;&lt;br&gt;
  But it's inefficient — the backbone's representation is being asked to do six jobs&lt;br&gt;&lt;br&gt;
  at once, with no signal about which job it's currently serving.&lt;/p&gt;

&lt;p&gt;The fix is small but powerful: give the model an explicit signal about which&lt;br&gt;&lt;br&gt;
  position it's predicting. A learnable position embedding.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;  &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;position_emb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Embedding&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# 6 positions, 10-dim each
&lt;/span&gt;
  &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;forward&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;position_idx&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
      &lt;span class="n"&gt;features&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cnn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;              &lt;span class="c1"&gt;# shared backbone output
&lt;/span&gt;      &lt;span class="n"&gt;pos&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;position_emb&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;position_idx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# which position?
&lt;/span&gt;      &lt;span class="n"&gt;combined&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cat&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;features&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pos&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
      &lt;span class="n"&gt;logits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;classifier&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;combined&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;logits&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now the backbone is asked one focused question — &lt;em&gt;what's at this specific position?&lt;/em&gt;&lt;br&gt;
   — and the position embedding provides the context. The model learns position-aware feature extraction without needing six separate backbones.&lt;/p&gt;

&lt;p&gt;The downstream effect is significant. With ~4,000 training samples, this design converged cleanly to 100% accuracy on the held-out test set. A naive multi-head architecture (without position embeddings) trained on the same dataset hits a lower accuracy ceiling, because the shared feature vector can't decompose its representation cleanly across positions.&lt;/p&gt;

&lt;p&gt;This pattern is worth internalizing: when multiple output heads share a backbone, give the backbone an explicit signal about which output it's serving. The signal can be a position embedding (as here), a class embedding (in multi-task learning), or any other discriminating context. The shared backbone learns better when it knows what it's working on.&lt;/p&gt;
&lt;h2&gt;
  
  
  The backbone: why eca_nfnet_l0 over plain CNN
&lt;/h2&gt;

&lt;p&gt;The shared CNN backbone needs to extract features from a small grayscale image (200x50) and produce a representation that the multi-head classifier can decode reliably. The default move for OCR work is a plain ResNet18 or VGG, but I went with &lt;code&gt;eca_nfnet_l0&lt;/code&gt; from the &lt;code&gt;timm&lt;/code&gt; library — a Normalizer-Free Net with Efficient Channel Attention.&lt;/p&gt;

&lt;p&gt;A few reasons for the choice:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Normalizer-Free Networks&lt;/strong&gt; skip BatchNorm and replace it with weight
standardization + adaptive gradient clipping. The architecture trains stably even at small batch sizes, and inference is faster (no BN running statistics to track).
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ECA blocks&lt;/strong&gt; add channel-wise attention with a 1D convolution rather than the standard squeeze-excitation MLP. Lower parameter count, similar accuracy gain.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pretrained weights&lt;/strong&gt; are available via &lt;code&gt;timm&lt;/code&gt;. Even though ImageNet has nothing to do with grayscale CAPTCHAs, the low-level filters (edges, textures, basic shapes) transfer fine and reduce the data needed to converge.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The model gets configured to take 1-channel input instead of 3:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;  &lt;span class="n"&gt;backbone&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;timm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
      &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;eca_nfnet_l0&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="n"&gt;pretrained&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="n"&gt;in_chans&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="n"&gt;num_classes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="n"&gt;global_pool&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;num_classes=0&lt;/code&gt; and &lt;code&gt;global_pool=''&lt;/code&gt; strip the final classification head and the global pooling layer — we want the raw feature map to attach our own multi-head classifier to.&lt;/p&gt;

&lt;h2&gt;
  
  
  Augmentation: defaults are wrong for CAPTCHAs
&lt;/h2&gt;

&lt;p&gt;The same lesson from my earlier YOLOv11 tutorial applies here: the default torchvision augmentation pipeline assumes you're training on natural images.&lt;br&gt;&lt;br&gt;
  CAPTCHAs are not natural images. The augmentations that help on ImageNet either don't help or actively hurt for CAPTCHA OCR.&lt;/p&gt;

&lt;p&gt;What I used for training:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;  &lt;span class="n"&gt;transform&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;transforms&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Compose&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
      &lt;span class="n"&gt;transforms&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Grayscale&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;num_output_channels&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
      &lt;span class="n"&gt;transforms&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;RandomRotation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;                              &lt;span class="c1"&gt;# ±5° — anything more is unrealistic
&lt;/span&gt;      &lt;span class="n"&gt;transforms&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;RandomAffine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;degrees&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;translate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;  &lt;span class="c1"&gt;# small translation  
&lt;/span&gt;      &lt;span class="n"&gt;transforms&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;RandomPerspective&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;distortion_scale&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="c1"&gt;# slight perspective drift
&lt;/span&gt;      &lt;span class="n"&gt;transforms&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ToTensor&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
      &lt;span class="n"&gt;transforms&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Normalize&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,)),&lt;/span&gt;
      &lt;span class="n"&gt;transforms&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Resize&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
  &lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three augmentations I deliberately &lt;em&gt;did not&lt;/em&gt; use:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No horizontal flips.&lt;/strong&gt; A flipped 6 looks like a 9. A flipped 7 doesn't look like any digit. Training on flips actively confuses the model.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No vertical flips.&lt;/strong&gt; Same logic.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No heavy rotation.&lt;/strong&gt; CAPTCHA samples already include some rotation. Adding ±30° would generate training data that doesn't reflect actual portal output, hurting generalization.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Three augmentations I &lt;em&gt;did&lt;/em&gt; use that the defaults wouldn't include thoughtfully:    &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Limited rotation (±5°)&lt;/strong&gt; to mimic the small in-the-wild rotation present in real CAPTCHA samples&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Translation augmentation&lt;/strong&gt; to handle variable horizontal position of digits
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Perspective distortion (mild)&lt;/strong&gt; to handle the subtle shear/skew the CAPTCHA generator applies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Validation transforms strip all augmentation — straight grayscale + normalize + resize. The validation set should reflect actual production input, not augmented training input.&lt;/p&gt;

&lt;h2&gt;
  
  
  Training: averaging six losses instead of summing them
&lt;/h2&gt;

&lt;p&gt;Each output head produces a 10-class logit vector for one digit position. The loss for each head is straightforward cross-entropy. The question is how to combine the six losses into a single training signal.&lt;/p&gt;

&lt;p&gt;The naive approach is to sum them:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;  &lt;span class="n"&gt;loss&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;loss1&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;loss2&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;loss3&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;loss4&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;loss5&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;loss6&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This works, but it changes the effective learning rate. With six loss terms summing, the gradient magnitude is roughly six times larger than a single-head model. To compensate, you'd need to divide your learning rate by ~6 to get equivalent training dynamics.&lt;/p&gt;

&lt;p&gt;The cleaner approach — what I used — is to average:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;  &lt;span class="n"&gt;loss&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;loss1&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;loss2&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;loss3&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;loss4&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;loss5&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;loss6&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mf"&gt;6.0&lt;/span&gt;
  &lt;span class="n"&gt;loss&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;backward&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now the gradient magnitude is comparable to a single-head model, so the standard Adam learning rate (3e-4 for fine-tuning a pretrained backbone) just works without further tuning.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Optimizer: Adam, &lt;code&gt;lr=3e-4&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Loss: averaged cross-entropy across six heads&lt;/li&gt;
&lt;li&gt;Batch size: 128&lt;/li&gt;
&lt;li&gt;Epochs: 150 max, manually stopped at epoch 74 once val-loss plateaued near zero&lt;/li&gt;
&lt;li&gt;No learning-rate schedule — for a task this narrow, on a pretrained backbone, default Adam dynamics were sufficient&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The training run produced a clean monotonic decrease in val-loss for the first ~30 epochs, then plateaued at the noise floor as the model hit 100% on the held-out set.&lt;br&gt;
   By epoch 74, val-loss was around 0.005 — effectively zero — so I stopped manually rather than running out the planned 150 epochs.&lt;/p&gt;
&lt;h2&gt;
  
  
  Test-time augmentation: belt-and-suspenders for production
&lt;/h2&gt;

&lt;p&gt;Once the trained model hits 100% on the held-out validation set, you'd think there's nothing left to do for inference. There isn't, accuracy-wise. But production has weirder inputs than test sets — different screenshot resolutions, slight color shifts, edge alignment differences from the live portal vs. the captured training&lt;br&gt;
  samples.&lt;br&gt;&lt;br&gt;
  For robustness against these distribution-shift cases, I added test-time&lt;br&gt;&lt;br&gt;
  augmentation (TTA): run the model on multiple lightly-modified versions of the same input image, average the predictions across versions, return the consensus.&lt;/p&gt;

&lt;p&gt;The pattern is simple:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;  &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;predict_with_tta&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
      &lt;span class="c1"&gt;# Variant 1: original input
&lt;/span&gt;      &lt;span class="n"&gt;logits_original&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

      &lt;span class="c1"&gt;# Variant 2: lightly center-cropped, then resized back
&lt;/span&gt;      &lt;span class="n"&gt;logits_cropped&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;transform_with_center_crop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

      &lt;span class="c1"&gt;# Average the logits before argmax
&lt;/span&gt;      &lt;span class="n"&gt;averaged&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;logits_original&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;logits_cropped&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mf"&gt;2.0&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;averaged&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;argmax&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The center-crop variant trims a bit of border and resizes back to the original input dimensions. This forces the model to see the digit content at a slightly different effective magnification, which in practice catches a small set of edge cases the un-cropped pass would miss on its own.&lt;/p&gt;

&lt;p&gt;Trade-off: roughly 2x inference time per CAPTCHA. For a portal that gets logged into a few times a minute, that's invisible (a few extra milliseconds). For a high-throughput service, you'd want to benchmark first.&lt;/p&gt;

&lt;p&gt;For a model already at 100% on validation, TTA is mostly belt-and-suspenders — catches the production-only edge cases I couldn't anticipate during training. Worth the small inference cost; not worth it for a less-mature model where you'd be better served improving training first.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd change next time
&lt;/h2&gt;

&lt;p&gt;Honest reflection on the things I'd revisit if I rebuilt this from scratch:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mixup, defined but unused.&lt;/strong&gt; I implemented a Mixup augmentation function in the training notebook but never wired it into the actual training loop. At 100% accuracy on the held-out set, Mixup probably wouldn't have helped — there's no headroom left to capture. But on a harder version of this task (more characters, more visually similar classes, less data), Mixup is one of the lowest-cost regularizers worth trying first.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Automated early stopping rather than babysitting.&lt;/strong&gt; I stopped training manually at epoch 74 by watching the val-loss curve in the notebook. A more disciplined run would have wired &lt;code&gt;patience=10&lt;/code&gt; early stopping directly into the training loop — same outcome, less babysitting, easier to reproduce later.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The dataset size question.&lt;/strong&gt; ~4,000 samples with strong augmentation got us to 100%. I never ran the experiment of "how few samples could we get away with?" The floor is probably around 1,500–2,000 samples for this exact CAPTCHA generator. For future similar projects, I'd start there and add more data only if accuracy plateaus below the target.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Transformer-based encoders for harder CAPTCHAs.&lt;/strong&gt; This CAPTCHA was 6 fixed-length numeric digits. If the task scaled to alphanumeric, variable-length, or adversarially-designed CAPTCHAs (the kind built specifically to defeat ML), a transformer-based vision encoder (ViT, Swin, or a TrOCR-style decoder) would be a more expressive starting point. The multi-head + position embedding approach has a ceiling beyond which it stops being the right tool.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Was TTA necessary?&lt;/strong&gt; Possibly not — given the model already hit 100%. The right answer would have been to compare production accuracy with and without TTA over a few weeks. I added TTA pre-emptively rather than measuring whether it was needed.&lt;br&gt;&lt;br&gt;
  That's an antipattern I'd correct.&lt;/p&gt;

&lt;p&gt;The general lesson here: at 100% accuracy on validation, the temptation is to keep adding tricks (TTA, ensembles, larger models). The actual move is to stop, measure on production, and only add complexity when production data tells you to.&lt;/p&gt;

&lt;h2&gt;
  
  
  The principle to take away
&lt;/h2&gt;

&lt;p&gt;When the structure of your task is known, encode that structure in your architecture rather than reaching for the most general possible model.&lt;/p&gt;

&lt;p&gt;For fixed-length CAPTCHA OCR, "the structure is known" meant six positions, ten classes per position, no length variation. The right architectural answer was six classification heads with a shared backbone and position embeddings — not a CRNN with sequence decoding and CTC loss.&lt;/p&gt;

&lt;p&gt;This pattern applies far beyond CAPTCHAs. A few examples where it shows up:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Form-field extraction&lt;/strong&gt; with a fixed schema (name, date, address, signature in known boxes). Don't use a free-form sequence model. Use field-specific heads attached to a shared document encoder.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-label classification&lt;/strong&gt; with a known label vocabulary. Don't use a generative decoder. Use one head per label.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Time series forecasting&lt;/strong&gt; with a known forecast horizon. The right architecture often has explicit per-horizon heads, not a single autoregressive decoder.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Structured information extraction&lt;/strong&gt; from a well-defined schema (invoices, lab reports, government forms). Slot-filling architecture beats sequence-to-sequence when the slots are stable.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The general engineering instinct is to reach for the most flexible model — the one that handles the widest range of inputs. For research and exploratory work, that's right. For production work where the input structure is genuinely known and stable, it's wrong. Specificity wins on training stability, sample efficiency, inference speed, and debuggability.&lt;/p&gt;

&lt;p&gt;For this CAPTCHA project, the result was 100% accuracy on a held-out test set with about 4,000 training samples and a model that runs in under 50ms on CPU. The constraints made the problem easier; matching the architecture to the constraints made the solution simple.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>python</category>
      <category>computervision</category>
      <category>pytorch</category>
    </item>
    <item>
      <title>Fine-tuning YOLOv11 to detect stamps and signatures on banking documents - a practical walkthrough</title>
      <dc:creator>Muhammad umair akram</dc:creator>
      <pubDate>Thu, 30 Apr 2026 14:16:55 +0000</pubDate>
      <link>https://forem.com/anticrusader/fine-tuning-yolov11-to-detect-stamps-and-signatures-on-banking-documents-a-practical-walkthrough-2a1g</link>
      <guid>https://forem.com/anticrusader/fine-tuning-yolov11-to-detect-stamps-and-signatures-on-banking-documents-a-practical-walkthrough-2a1g</guid>
      <description>&lt;p&gt;Every day, banking ops teams manually review thousands of documents - &lt;br&gt;
 loan applications, KYC forms, contracts - looking for the right stamps,&lt;br&gt;
 the right signatures, in the right places. It's slow, expensive, and&lt;br&gt;
 exactly the kind of work computer vision was made to automate.&lt;br&gt;
The catch is that most YOLO tutorials online teach you to detect cars,&lt;br&gt;
 dogs, or people in natural photos. None of that translates cleanly to&lt;br&gt;
 documents. Documents are structured, scanned at varying quality, often&lt;br&gt;
 photographed on phones at angles, sometimes faxed, frequently watermarked, and almost never lit consistently. The model that detects stamps on a&lt;br&gt;
 clean PDF will collapse on a phone-shot photo of the same form.&lt;br&gt;
"Over the past few weeks I've been deep in shipping a YOLOv11-based&lt;br&gt;
 detector for stamps and signatures on documents in a regulated banking&lt;br&gt;
 environment."&lt;br&gt;
The work taught me where the off-the-shelf tutorials end and where the&lt;br&gt;
 real engineering begins. Here's the playbook.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why YOLOv11 over the alternatives
&lt;/h2&gt;

&lt;p&gt;There are a few reasonable starting points for document object detection:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Layout-aware models like LayoutLMv3 or Donut&lt;/strong&gt; - strong for structured forms, but heavier, harder to fine-tune for a narrow task, and slower at inference. Overkill if you only need to detect a small set of objects
 (stamps, signatures, initials).
 - &lt;strong&gt;Classical OpenCV approaches&lt;/strong&gt; - template matching, contour detection, Hough transforms. Fast and lightweight but brittle on real-world scans.
 - &lt;strong&gt;YOLO family (v8, v11)&lt;/strong&gt; - the sweet spot for object detection on
 documents. Fast, well-documented, easy to fine-tune, and the
 precision/recall tradeoff is tunable to ops-team requirements.
I went with YOLOv11. The &lt;code&gt;ultralytics&lt;/code&gt; Python package handles most of the
 busywork, inference runs well under 100ms per page on a modest GPU, and the architecture handles small objects - which stamps often are at low
 scan resolutions - better than older versions.
## The 80%: data preparation and annotation
Anyone who's shipped CV in production will tell you the same thing: the
 model is the easy part. Data is where the time goes.
&lt;strong&gt;Annotation tooling.&lt;/strong&gt; I used Roboflow - clean web UI for bounding-box
 labeling, automatic train/val/test splits, easy export to YOLO format.
 CVAT is the open-source alternative if you can't use a SaaS for
 compliance reasons.
&lt;strong&gt;Class taxonomy.&lt;/strong&gt; Resist the urge to define ten classes on day one.
 Start with the smallest set that solves the business problem:&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;signature&lt;/code&gt;
 - &lt;code&gt;stamp&lt;/code&gt;
 - (Optionally &lt;code&gt;handwritten_initials&lt;/code&gt; if your forms include them)
More classes means more labeled examples per class, more failure modes,
 and a harder model to debug. You can always split a class later. You
 can rarely merge messy ones cleanly.
&lt;strong&gt;Train/val/test split discipline.&lt;/strong&gt; Separate documents into the three
 splits &lt;em&gt;by source&lt;/em&gt;, not just randomly. If the same form template appears
 in both train and val, your validation metric is lying to you - the
 model is learning the form layout, not the object. In a regulated
 environment where wrong predictions cost real money, you cannot afford
 a lying validation set.
&lt;strong&gt;Augmentation strategy - and why the defaults are wrong for documents.&lt;/strong&gt;
 The off-the-shelf YOLO augmentation defaults are designed for natural
 images. They include rotation up to 30°, mosaic, MixUp. For documents,
 that's actively wrong:&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rotation should be tightly limited (±5°).&lt;/strong&gt; Documents are upright.
 Heavy rotation creates training examples that don't reflect production
 input.
 - &lt;strong&gt;Mosaic augmentation should be off.&lt;/strong&gt; Pasting four documents into a
 2×2 grid produces inputs that don't exist at inference time.
 - &lt;strong&gt;What helps instead:&lt;/strong&gt; brightness/contrast variation (different scan
 qualities), JPEG compression noise (low-quality scans), partial
 occlusion (parts of the document obscured), Gaussian blur (out-of-focus
 phone shots).
"The single biggest accuracy gain in my project came from augmenting for phone-photographed scans. Production data was messier than my training set assumed - closing that gap mattered more than any architecture change."
## Training configuration that actually matters
Most YOLO hyperparameters are fine at defaults. The ones that move the
 needle on documents:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ultralytics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;YOLO&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;YOLO&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;yolo11m.pt&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;train&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;dataset.yaml&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;epochs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;imgsz&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# higher imgsz matters for small stamps
&lt;/span&gt;&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;lr0&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.001&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;patience&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# early stopping if mAP stalls
&lt;/span&gt;&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;augment&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;mosaic&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# off for documents
&lt;/span&gt;&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;degrees&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# limit rotation
&lt;/span&gt;&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;fliplr&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# don't horizontally flip docs
&lt;/span&gt;&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="sb"&gt;``&lt;/span&gt;&lt;span class="err"&gt;`&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;endraw&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;Two&lt;/span&gt; &lt;span class="n"&gt;things&lt;/span&gt; &lt;span class="n"&gt;worth&lt;/span&gt; &lt;span class="n"&gt;flagging&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="sb"&gt;`imgsz=1024`&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;endraw&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="mf"&gt;640.&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="n"&gt;Stamps&lt;/span&gt; &lt;span class="n"&gt;at&lt;/span&gt; &lt;span class="n"&gt;low&lt;/span&gt; &lt;span class="n"&gt;resolution&lt;/span&gt; &lt;span class="n"&gt;can&lt;/span&gt; &lt;span class="n"&gt;become&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="n"&gt;few&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;pixels&lt;/span&gt;&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;too&lt;/span&gt; &lt;span class="n"&gt;small&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="n"&gt;detect&lt;/span&gt; &lt;span class="n"&gt;reliably&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt; &lt;span class="n"&gt;Higher&lt;/span&gt; &lt;span class="nb"&gt;input&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;costs&lt;/span&gt; &lt;span class="n"&gt;more&lt;/span&gt; &lt;span class="n"&gt;compute&lt;/span&gt; &lt;span class="n"&gt;per&lt;/span&gt; &lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;but&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;precision&lt;/span&gt; &lt;span class="n"&gt;gain&lt;/span&gt; &lt;span class="n"&gt;on&lt;/span&gt; &lt;span class="n"&gt;small&lt;/span&gt; &lt;span class="n"&gt;objects&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="n"&gt;substantial&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;Disable&lt;/span&gt; &lt;span class="n"&gt;horizontal&lt;/span&gt; &lt;span class="n"&gt;flipping&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt; &lt;span class="n"&gt;flipped&lt;/span&gt; &lt;span class="n"&gt;form&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="n"&gt;wrong&lt;/span&gt; &lt;span class="n"&gt;form&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;Augmentations&lt;/span&gt; &lt;span class="n"&gt;that&lt;/span&gt; &lt;span class="n"&gt;produce&lt;/span&gt; &lt;span class="n"&gt;never&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;seen&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="ow"&gt;in&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;production&lt;/span&gt; &lt;span class="n"&gt;inputs&lt;/span&gt; &lt;span class="n"&gt;hurt&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;generalization&lt;/span&gt; &lt;span class="n"&gt;on&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;inputs&lt;/span&gt; &lt;span class="n"&gt;you&lt;/span&gt; &lt;span class="n"&gt;actually&lt;/span&gt; &lt;span class="n"&gt;care&lt;/span&gt; &lt;span class="n"&gt;about&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
&lt;span class="c1"&gt;## The metric you should actually optimize for
&lt;/span&gt;&lt;span class="n"&gt;Most&lt;/span&gt; &lt;span class="n"&gt;tutorials&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="sb"&gt;`mAP@0.5`&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;endraw&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="p"&gt;}.&lt;/span&gt; &lt;span class="n"&gt;For&lt;/span&gt; &lt;span class="n"&gt;document&lt;/span&gt; &lt;span class="n"&gt;AI&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="n"&gt;regulated&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;environment&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;that&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s the wrong primary metric.
Ops teams care about **precision**. When the model says &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;there&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;signature&lt;/span&gt; &lt;span class="n"&gt;here&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; they need it to be right. A false positive sends a
 document downstream that shouldn&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t be there, costing reviewer time. A
 false negative is recoverable - the document falls back to manual
 review, which is the existing baseline.
Track both, but if you have to optimize one, optimize precision. Your
 ops manager will thank you.
## Inference and deployment
A model that runs on a GPU is fun. A model that runs on a CPU is
 shippable. For most document-AI workloads - where you&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;re processing on the order of dozens to hundreds of pages per minute, not millions - 
 CPU inference with an ONNX-exported model is faster to deploy, cheaper 
 to run, and far more compatible with locked-down production environments 
 where GPU drivers are a fight you don&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t want.
The flow is:
1. Train with {% raw %}`ultralytics` (PyTorch backend, GPU during training)
 2. Export the trained weights to ONNX
 3. Serve via `ultralytics`&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s ONNX-runtime path on CPU at inference time
Step 2 is one line:


```python
 from ultralytics import YOLO
model = YOLO(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;best.pt&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;)
 model.export(format=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;onnx&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;) # writes best.onnx alongside best.pt
 ```


Step 3 - the inference service:


```python
 from fastapi import FastAPI, UploadFile
 from ultralytics import YOLO
 from PIL import Image
 import io
app = FastAPI()
 model = YOLO(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;best.onnx&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;) # ONNX runtime, CPU-only
@app.post(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/detect&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;)
 async def detect(file: UploadFile):
 image = Image.open(io.BytesIO(await file.read()))
 results = model(image)
detections = []
 for r in results:
 for box in r.boxes:
 detections.append({
 &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;class&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;: model.names[int(box.cls)],
 &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;confidence&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;: float(box.conf),
 &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;bbox&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;: box.xyxy.tolist()[0],
 })
return {&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;detections&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;: detections}
 ```


The most important line in that snippet is `model = YOLO(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;best.onnx&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;)`
 at module level - load the model **once at startup**, never per request.
 Reloading the model on every request is the most common production
 mistake I&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ve seen on YOLO endpoints. It&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s the difference between 50ms
 response time and 5,000ms.
For the container: a slim Python base image (`python:3.11-slim`) is
 enough. No CUDA, no GPU drivers, no NVIDIA dependencies. The image
 ends up under 500MB, starts in seconds, and runs anywhere - including
 locked-down corporate VMs and on-prem environments where shipping a
 GPU-dependent service is months of approvals you don&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t have.
That&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s the real tradeoff: you give up a small amount of per-request
 latency in exchange for a service that deploys today, not next quarter.

## What the tutorials don&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t tell you
Three lessons the standard YOLO blog posts skip:
**1. The long tail of weird scans is where production breaks.** Faxed
 pages with horizontal banding, partially photocopied documents, phone
 shots with one corner cut off, watermarks bleeding through from the
 back side. Your training set won&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t include enough of these. Get a
 sample of real production input as fast as possible - even just 50
 images - and use them for evaluation, not training. They tell you what
 the world actually looks like.
**2. Log every prediction with the input image hash.** When the model
 fails in production, you want to be able to find the exact input that
 broke it, retroactively. Hash the input, log the prediction, store both.
 That&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s how you build round-2 training data without hunting.
**3. Don&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t chase mAP@0.95.** Diminishing returns. If your business
 needs 95% precision at 70% recall, optimize for that operating point - 
 not for a metric that summarizes the whole curve. Talk to your ops
 team. Get the actual numbers they care about. Train against those.
## Closing
The model is not the bottleneck for document AI. The bottleneck is
 annotation discipline, augmentation tuned to real production input,
 and deployment that doesn&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t blow up under load. If you&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;re building
 computer vision for regulated industries - banking, insurance, legal,
 healthcare - the playbook above is what&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s worked for me. The frameworks
 change. The data discipline doesn&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t.
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>computervision</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Fine-tuning YOLOv11 to detect stamps and signatures on banking documents - a practical walkthrough</title>
      <dc:creator>Muhammad umair akram</dc:creator>
      <pubDate>Thu, 30 Apr 2026 14:16:55 +0000</pubDate>
      <link>https://forem.com/anticrusader/fine-tuning-yolov11-to-detect-stamps-and-signatures-on-banking-documents-a-practical-walkthrough-5753</link>
      <guid>https://forem.com/anticrusader/fine-tuning-yolov11-to-detect-stamps-and-signatures-on-banking-documents-a-practical-walkthrough-5753</guid>
      <description>&lt;p&gt;Every day, banking ops teams manually review thousands of documents - &lt;br&gt;
 loan applications, KYC forms, contracts - looking for the right stamps,&lt;br&gt;
 the right signatures, in the right places. It's slow, expensive, and&lt;br&gt;
 exactly the kind of work computer vision was made to automate.&lt;br&gt;
The catch is that most YOLO tutorials online teach you to detect cars,&lt;br&gt;
 dogs, or people in natural photos. None of that translates cleanly to&lt;br&gt;
 documents. Documents are structured, scanned at varying quality, often&lt;br&gt;
 photographed on phones at angles, sometimes faxed, frequently watermarked, and almost never lit consistently. The model that detects stamps on a&lt;br&gt;
 clean PDF will collapse on a phone-shot photo of the same form.&lt;/p&gt;

&lt;p&gt;"Over the past few weeks I've been deep in shipping a YOLOv11-based&lt;br&gt;
 detector for stamps and signatures on documents in a regulated banking&lt;br&gt;
 environment."&lt;/p&gt;

&lt;p&gt;The work taught me where the off-the-shelf tutorials end and where the&lt;br&gt;
 real engineering begins. Here's the playbook.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why YOLOv11 over the alternatives
&lt;/h2&gt;

&lt;p&gt;There are a few reasonable starting points for document object detection:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Layout-aware models like LayoutLMv3 or Donut&lt;/strong&gt; - strong for structured forms, but heavier, harder to fine-tune for a narrow task, and slower at inference. Overkill if you only need to detect a small set of objects
 (stamps, signatures, initials).
 - &lt;strong&gt;Classical OpenCV approaches&lt;/strong&gt; - template matching, contour detection, Hough transforms. Fast and lightweight but brittle on real-world scans.
 - &lt;strong&gt;YOLO family (v8, v11)&lt;/strong&gt; - the sweet spot for object detection on
 documents. Fast, well-documented, easy to fine-tune, and the
 precision/recall tradeoff is tunable to ops-team requirements.
I went with YOLOv11. The &lt;strong&gt;ultralytics&lt;/strong&gt; Python package handles most of the
 busywork, inference runs well under 100ms per page on a modest GPU, and the architecture handles small objects - which stamps often are at low
 scan resolutions - better than older versions.
## The 80%: data preparation and annotation
Anyone who's shipped CV in production will tell you the same thing: the
 model is the easy part. Data is where the time goes.
&lt;strong&gt;Annotation tooling.&lt;/strong&gt; I used Roboflow - clean web UI for bounding-box
 labeling, automatic train/val/test splits, easy export to YOLO format.
 CVAT is the open-source alternative if you can't use a SaaS for
 compliance reasons.
&lt;strong&gt;Class taxonomy.&lt;/strong&gt; Resist the urge to define ten classes on day one.
 Start with the smallest set that solves the business problem:&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;signature&lt;/strong&gt;
 - &lt;strong&gt;stamp&lt;/strong&gt;
 - (Optionally &lt;strong&gt;handwritten_initials&lt;/strong&gt; if your forms include them)
More classes means more labeled examples per class, more failure modes,
 and a harder model to debug. You can always split a class later. You
 can rarely merge messy ones cleanly.
&lt;strong&gt;Train/val/test split discipline.&lt;/strong&gt; Separate documents into the three
 splits &lt;em&gt;by source&lt;/em&gt;, not just randomly. If the same form template appears
 in both train and val, your validation metric is lying to you - the
 model is learning the form layout, not the object. In a regulated
 environment where wrong predictions cost real money, you cannot afford
 a lying validation set.
&lt;strong&gt;Augmentation strategy - and why the defaults are wrong for documents.&lt;/strong&gt;
 The off-the-shelf YOLO augmentation defaults are designed for natural
 images. They include rotation up to 30°, mosaic, MixUp. For documents,
 that's actively wrong:&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rotation should be tightly limited (±5°).&lt;/strong&gt; Documents are upright.
 Heavy rotation creates training examples that don't reflect production
 input.
 - &lt;strong&gt;Mosaic augmentation should be off.&lt;/strong&gt; Pasting four documents into a
 2×2 grid produces inputs that don't exist at inference time.
 - &lt;strong&gt;What helps instead:&lt;/strong&gt; brightness/contrast variation (different scan
 qualities), JPEG compression noise (low-quality scans), partial
 occlusion (parts of the document obscured), Gaussian blur (out-of-focus
 phone shots).
"The single biggest accuracy gain in my project came from augmenting for phone-photographed scans. Production data was messier than my training set assumed - closing that gap mattered more than any architecture change."
## Training configuration that actually matters
Most YOLO hyperparameters are fine at defaults. The ones that move the
 needle on documents:
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ultralytics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;YOLO&lt;/span&gt;

&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;YOLO&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;yolo11m.pt&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;train&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;dataset.yaml&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;epochs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;imgsz&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# Higher imgsz matters for small stamps
&lt;/span&gt;    &lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;lr0&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.001&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;patience&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Early stopping if mAP stalls
&lt;/span&gt;    &lt;span class="n"&gt;augment&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;mosaic&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# Off for documents
&lt;/span&gt;    &lt;span class="n"&gt;degrees&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="c1"&gt;# Limit rotation
&lt;/span&gt;    &lt;span class="n"&gt;fliplr&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;    &lt;span class="c1"&gt;# Don't horizontally flip docs
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Two things worth flagging:&lt;br&gt;
 Stamps at low resolution can become a few&lt;br&gt;
 pixels - too small for the model to detect reliably. Higher input size&lt;br&gt;
 costs more compute per image, but the precision gain on small objects&lt;br&gt;
 is substantial.&lt;br&gt;
&lt;strong&gt;Disable horizontal flipping.&lt;/strong&gt; A flipped form is a wrong form.&lt;br&gt;
 Augmentations that produce never-seen-in-production inputs hurt&lt;br&gt;
 generalization on the inputs you actually care about.&lt;/p&gt;
&lt;h2&gt;
  
  
  The metric you should actually optimize for
&lt;/h2&gt;

&lt;p&gt;Most tutorials default to &lt;strong&gt;&lt;a href="mailto:mAP@0.5"&gt;mAP@0.5&lt;/a&gt;&lt;/strong&gt;. For document AI in a regulated&lt;br&gt;
 environment, that's the wrong primary metric.&lt;br&gt;
Ops teams care about &lt;strong&gt;precision&lt;/strong&gt;. When the model says "there's a&lt;br&gt;
 signature here," they need it to be right. A false positive sends a&lt;br&gt;
 document downstream that shouldn't be there, costing reviewer time. A&lt;br&gt;
 false negative is recoverable - the document falls back to manual&lt;br&gt;
 review, which is the existing baseline.&lt;br&gt;
Track both, but if you have to optimize one, optimize precision. Your&lt;br&gt;
 ops manager will thank you.&lt;/p&gt;
&lt;h2&gt;
  
  
  Inference and deployment
&lt;/h2&gt;

&lt;p&gt;A model that runs on a GPU is fun. A model that runs on a CPU is&lt;br&gt;
 shippable. For most document-AI workloads - where you're processing on the order of dozens to hundreds of pages per minute, not millions - &lt;br&gt;
 CPU inference with an ONNX-exported model is faster to deploy, cheaper &lt;br&gt;
 to run, and far more compatible with locked-down production environments &lt;br&gt;
 where GPU drivers are a fight you don't want.&lt;br&gt;
The flow is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Train with &lt;strong&gt;ultralytics&lt;/strong&gt; (PyTorch backend, GPU during training)
 2. Export the trained weights to ONNX
 3. Serve via &lt;strong&gt;ultralytics&lt;/strong&gt;'s ONNX-runtime path on CPU at inference time
Step 2 is one line:
&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ultralytics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;YOLO&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;YOLO&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;best.pt&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;export&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;onnx&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# writes best.onnx alongside best.pt
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Step 3 - the inference service:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fastapi&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FastAPI&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;UploadFile&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ultralytics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;YOLO&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;PIL&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Image&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;io&lt;/span&gt;
&lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FastAPI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;YOLO&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;best.onnx&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# ONNX runtime, CPU-only
&lt;/span&gt;&lt;span class="nd"&gt;@app.post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/detect&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;detect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;UploadFile&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Image&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;io&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;BytesIO&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;()))&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;detections&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;box&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;boxes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;detections&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;class&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;names&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;box&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cls&lt;/span&gt;&lt;span class="p"&gt;)],&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;confidence&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;box&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conf&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;bbox&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;box&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xyxy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tolist&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;detections&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;detections&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The most important line in that snippet is &lt;strong&gt;model = YOLO('best.onnx')&lt;/strong&gt;&lt;br&gt;
 at module level - load the model &lt;strong&gt;once at startup&lt;/strong&gt;, never per request.&lt;br&gt;
 Reloading the model on every request is the most common production&lt;br&gt;
 mistake I've seen on YOLO endpoints. It's the difference between 50ms&lt;br&gt;
 response time and 5,000ms.&lt;br&gt;
For the container: a slim Python base image (&lt;strong&gt;python:3.11-slim&lt;/strong&gt;) is&lt;br&gt;
 enough. No CUDA, no GPU drivers, no NVIDIA dependencies. The image&lt;br&gt;
 ends up under 500MB, starts in seconds, and runs anywhere - including&lt;br&gt;
 locked-down corporate VMs and on-prem environments where shipping a&lt;br&gt;
 GPU-dependent service is months of approvals you don't have.&lt;br&gt;
That's the real tradeoff: you give up a small amount of per-request&lt;br&gt;
 latency in exchange for a service that deploys today, not next quarter.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the tutorials don't tell you
&lt;/h2&gt;

&lt;p&gt;Three lessons the standard YOLO blog posts skip:&lt;br&gt;
&lt;strong&gt;1. The long tail of weird scans is where production breaks.&lt;/strong&gt; Faxed&lt;br&gt;
 pages with horizontal banding, partially photocopied documents, phone&lt;br&gt;
 shots with one corner cut off, watermarks bleeding through from the&lt;br&gt;
 back side. Your training set won't include enough of these. Get a&lt;br&gt;
 sample of real production input as fast as possible - even just 50&lt;br&gt;
 images - and use them for evaluation, not training. They tell you what&lt;br&gt;
 the world actually looks like.&lt;br&gt;
&lt;strong&gt;2. Log every prediction with the input image hash.&lt;/strong&gt; When the model&lt;br&gt;
 fails in production, you want to be able to find the exact input that&lt;br&gt;
 broke it, retroactively. Hash the input, log the prediction, store both.&lt;br&gt;
 That's how you build round-2 training data without hunting.&lt;br&gt;
&lt;strong&gt;3. Don't chase &lt;a href="mailto:mAP@0.95"&gt;mAP@0.95&lt;/a&gt;.&lt;/strong&gt; Diminishing returns. If your business&lt;br&gt;
 needs 95% precision at 70% recall, optimize for that operating point - &lt;br&gt;
 not for a metric that summarizes the whole curve. Talk to your ops&lt;br&gt;
 team. Get the actual numbers they care about. Train against those.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing
&lt;/h2&gt;

&lt;p&gt;The model is not the bottleneck for document AI. The bottleneck is&lt;br&gt;
 annotation discipline, augmentation tuned to real production input,&lt;br&gt;
 and deployment that doesn't blow up under load. If you're building&lt;br&gt;
 computer vision for regulated industries - banking, insurance, legal,&lt;br&gt;
 healthcare - the playbook above is what's worked for me. The frameworks&lt;br&gt;
 change. The data discipline doesn't.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>computervision</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
