<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: shangkyu shin</title>
    <description>The latest articles on Forem by shangkyu shin (@zeromathai).</description>
    <link>https://forem.com/zeromathai</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3872570%2Fc7bba9ef-1a14-44b5-a02d-f6720ab48ab8.png</url>
      <title>Forem: shangkyu shin</title>
      <link>https://forem.com/zeromathai</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/zeromathai"/>
    <language>en</language>
    <item>
      <title>CNN Training Isn’t Just About Models — Augmentation vs Preprocessing vs BatchNorm</title>
      <dc:creator>shangkyu shin</dc:creator>
      <pubDate>Sat, 11 Apr 2026 19:09:48 +0000</pubDate>
      <link>https://forem.com/zeromathai/cnn-training-isnt-just-about-models-augmentation-vs-preprocessing-vs-batchnorm-2gd9</link>
      <guid>https://forem.com/zeromathai/cnn-training-isnt-just-about-models-augmentation-vs-preprocessing-vs-batchnorm-2gd9</guid>
      <description>&lt;p&gt;Struggling with CNN training? Learn how data augmentation, preprocessing, and batch normalization improve generalization, optimize input scaling, and stabilize deep learning models. A practical guide to what actually matters in real-world CNN pipelines.&lt;/p&gt;

&lt;p&gt;Cross-posted from Zeromath. Original article:&lt;br&gt;
&lt;a href="https://zeromathai.com/en/cnn-data-processing-normalization-en/" rel="noopener noreferrer"&gt;https://zeromathai.com/en/cnn-data-processing-normalization-en/&lt;/a&gt;&lt;/p&gt;




&lt;h1&gt;
  
  
  Stop treating augmentation, preprocessing, and BatchNorm like the same tool
&lt;/h1&gt;

&lt;p&gt;A lot of CNN advice gets blurry right here.&lt;/p&gt;

&lt;p&gt;People say:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;use augmentation,&lt;/li&gt;
&lt;li&gt;normalize the data,&lt;/li&gt;
&lt;li&gt;add BatchNorm.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All true.&lt;/p&gt;

&lt;p&gt;But these are not three versions of the same trick.&lt;/p&gt;

&lt;p&gt;They solve different problems at different stages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data augmentation&lt;/strong&gt; fixes a generalization problem.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data preprocessing&lt;/strong&gt; fixes an input-distribution problem.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;BatchNorm&lt;/strong&gt; fixes an internal-training-stability problem.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you keep that distinction clear, a lot of CNN tuning decisions get easier.&lt;/p&gt;




&lt;h1&gt;
  
  
  1. Data augmentation fixes overfitting in data space
&lt;/h1&gt;

&lt;h2&gt;
  
  
  What goes wrong without it
&lt;/h2&gt;

&lt;p&gt;A CNN trained on narrow data learns narrow patterns.&lt;/p&gt;

&lt;p&gt;It memorizes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;exact positions,&lt;/li&gt;
&lt;li&gt;exact orientations,&lt;/li&gt;
&lt;li&gt;exact lighting conditions,&lt;/li&gt;
&lt;li&gt;exact textures.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then validation performance drops as soon as the real input shifts a little.&lt;/p&gt;

&lt;p&gt;If your model starts overfitting in just a few epochs, augmentation is usually a better first lever than adding more layers.&lt;/p&gt;

&lt;h2&gt;
  
  
  What augmentation actually does
&lt;/h2&gt;

&lt;p&gt;It creates &lt;strong&gt;new valid variations&lt;/strong&gt; of existing training examples.&lt;/p&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;flips,&lt;/li&gt;
&lt;li&gt;translations,&lt;/li&gt;
&lt;li&gt;crops,&lt;/li&gt;
&lt;li&gt;affine transforms,&lt;/li&gt;
&lt;li&gt;noise,&lt;/li&gt;
&lt;li&gt;color jitter,&lt;/li&gt;
&lt;li&gt;elastic deformation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is not random distortion for the sake of distortion.&lt;/p&gt;

&lt;p&gt;It teaches the model:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;these appearance changes do not change the label&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is why augmentation is really about &lt;strong&gt;learning invariance&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  When augmentation becomes harmful
&lt;/h2&gt;

&lt;p&gt;This is where people get careless.&lt;/p&gt;

&lt;p&gt;Flipping a natural object image may be fine.&lt;br&gt;
Flipping a medical image may not be fine.&lt;br&gt;
Rotating a character or digit can silently change the class.&lt;/p&gt;

&lt;p&gt;So the rule is simple:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;only use augmentation if the label still stays valid&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Best mental model:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;augmentation = generalization in data space&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It does not clean feature scale.&lt;br&gt;
It does not stabilize hidden layers.&lt;br&gt;
It just makes the training distribution harder to overfit.&lt;/p&gt;




&lt;h1&gt;
  
  
  2. Preprocessing fixes bad scaling and input geometry
&lt;/h1&gt;

&lt;h2&gt;
  
  
  What goes wrong without it
&lt;/h2&gt;

&lt;p&gt;Raw input is messy.&lt;/p&gt;

&lt;p&gt;Typical issues:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;non-zero mean,&lt;/li&gt;
&lt;li&gt;inconsistent scale,&lt;/li&gt;
&lt;li&gt;correlated features.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That hurts optimization before the model even gets a chance to learn anything interesting.&lt;/p&gt;

&lt;p&gt;One feature can dominate just because its numbers are bigger.&lt;br&gt;
Updates become inefficient.&lt;br&gt;
Training becomes slower than it needs to be.&lt;/p&gt;

&lt;h2&gt;
  
  
  What preprocessing usually includes
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Zero-centering
&lt;/h3&gt;

&lt;p&gt;Subtract the mean so values are centered around zero.&lt;/p&gt;

&lt;h3&gt;
  
  
  Normalization / standardization
&lt;/h3&gt;

&lt;p&gt;Conceptually:&lt;/p&gt;

&lt;p&gt;(x - μ) / σ&lt;/p&gt;

&lt;p&gt;Now features are measured relative to their own variability instead of raw magnitude.&lt;/p&gt;

&lt;h3&gt;
  
  
  Decorrelation
&lt;/h3&gt;

&lt;p&gt;Reduce redundancy between correlated dimensions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Whitening
&lt;/h3&gt;

&lt;p&gt;The mathematically stronger version:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;zero mean,&lt;/li&gt;
&lt;li&gt;reduced correlation,&lt;/li&gt;
&lt;li&gt;normalized variance.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What actually matters in practice
&lt;/h2&gt;

&lt;p&gt;Whitening is elegant on paper, but per-channel normalization is usually the practical default.&lt;/p&gt;

&lt;p&gt;That is the kind of trade-off people often miss.&lt;/p&gt;

&lt;p&gt;You do not always need the most theoretically complete method.&lt;br&gt;
You need the method that makes optimization cleaner without adding unnecessary complexity.&lt;/p&gt;

&lt;p&gt;So in many real CNN pipelines, this is enough:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;mean subtraction,&lt;/li&gt;
&lt;li&gt;standardization,&lt;/li&gt;
&lt;li&gt;per-channel normalization.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Best mental model:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;preprocessing = making the raw input optimization-friendly&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It is not about creating diversity.&lt;br&gt;
It is not a replacement for BatchNorm.&lt;br&gt;
It solves a different problem.&lt;/p&gt;




&lt;h1&gt;
  
  
  3. BatchNorm fixes instability inside the network
&lt;/h1&gt;

&lt;h2&gt;
  
  
  What goes wrong without it
&lt;/h2&gt;

&lt;p&gt;Even if the input is normalized well, deeper training can still become unstable.&lt;/p&gt;

&lt;p&gt;Why?&lt;/p&gt;

&lt;p&gt;Because each layer changes during learning.&lt;br&gt;
That means downstream layers keep seeing shifting inputs.&lt;/p&gt;

&lt;p&gt;So later layers are learning on top of moving targets.&lt;/p&gt;

&lt;h2&gt;
  
  
  What BatchNorm actually does
&lt;/h2&gt;

&lt;p&gt;Batch normalization normalizes internal activations using mini-batch statistics.&lt;/p&gt;

&lt;p&gt;But the important part is what comes next.&lt;/p&gt;

&lt;p&gt;It does not stop at normalization.&lt;/p&gt;

&lt;p&gt;It also applies a learnable scale and shift afterward.&lt;/p&gt;

&lt;p&gt;That detail matters a lot.&lt;/p&gt;

&lt;p&gt;Without that second step, normalization could become too restrictive.&lt;br&gt;
With it, the network gets stability &lt;strong&gt;and&lt;/strong&gt; keeps expressive flexibility.&lt;/p&gt;

&lt;p&gt;So BatchNorm is better understood as:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;normalization + representation recovery&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Why engineers like it
&lt;/h2&gt;

&lt;p&gt;Because it often gives you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;smoother optimization,&lt;/li&gt;
&lt;li&gt;more stable gradients,&lt;/li&gt;
&lt;li&gt;faster convergence,&lt;/li&gt;
&lt;li&gt;easier tuning,&lt;/li&gt;
&lt;li&gt;and more reliable deeper training.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If training is unstable even after input normalization, BatchNorm is probably solving a different problem, not the same one.&lt;/p&gt;

&lt;p&gt;Best mental model:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;BatchNorm = stabilization in feature space&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Not data space.&lt;br&gt;
Not raw input space.&lt;br&gt;
Internal feature space.&lt;/p&gt;




&lt;h1&gt;
  
  
  4. The comparison that clears everything up
&lt;/h1&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Technique&lt;/th&gt;
&lt;th&gt;Main bottleneck&lt;/th&gt;
&lt;th&gt;Where it acts&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Data Augmentation&lt;/td&gt;
&lt;td&gt;Overfitting&lt;/td&gt;
&lt;td&gt;Training data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data Preprocessing&lt;/td&gt;
&lt;td&gt;Bad scaling / poor input geometry&lt;/td&gt;
&lt;td&gt;Raw input&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BatchNorm&lt;/td&gt;
&lt;td&gt;Internal instability&lt;/td&gt;
&lt;td&gt;Hidden activations&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This is the distinction that matters.&lt;/p&gt;

&lt;p&gt;When someone says “just normalize it,” the next question should be:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;normalize what, exactly?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Because the answer changes the tool.&lt;/p&gt;




&lt;h1&gt;
  
  
  5. A real pipeline that actually makes sense
&lt;/h1&gt;

&lt;p&gt;A practical CNN workflow often looks like this:&lt;/p&gt;

&lt;h2&gt;
  
  
  During data loading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;resize if needed,&lt;/li&gt;
&lt;li&gt;compute or use dataset statistics,&lt;/li&gt;
&lt;li&gt;apply per-channel normalization.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  During training only
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;apply flips, crops, translations, or other label-safe augmentation.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Inside the model
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;use BatchNorm where the architecture expects it,&lt;/li&gt;
&lt;li&gt;let it stabilize internal activations while training.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This layered view is much more useful than throwing all three techniques into one mental bucket.&lt;/p&gt;




&lt;h1&gt;
  
  
  6. Common mistakes
&lt;/h1&gt;

&lt;h2&gt;
  
  
  “BatchNorm replaces preprocessing”
&lt;/h2&gt;

&lt;p&gt;No.&lt;/p&gt;

&lt;p&gt;BatchNorm stabilizes hidden activations during learning.&lt;br&gt;
It does not remove the need for reasonable input scaling.&lt;/p&gt;

&lt;h2&gt;
  
  
  “More augmentation is always better”
&lt;/h2&gt;

&lt;p&gt;No.&lt;/p&gt;

&lt;p&gt;Bad augmentation creates semantically broken samples and injects label noise.&lt;/p&gt;

&lt;h2&gt;
  
  
  “Whitening must be best because it is more complete”
&lt;/h2&gt;

&lt;p&gt;Not necessarily.&lt;/p&gt;

&lt;p&gt;A more elegant preprocessing method is not always a better engineering choice.&lt;/p&gt;

&lt;h2&gt;
  
  
  “These are all just regularization tricks”
&lt;/h2&gt;

&lt;p&gt;Only partly.&lt;/p&gt;

&lt;p&gt;Augmentation is much more directly tied to generalization.&lt;br&gt;
Preprocessing and BatchNorm are much more directly tied to optimization and stability.&lt;/p&gt;




&lt;h1&gt;
  
  
  Final takeaway
&lt;/h1&gt;

&lt;p&gt;If you want one clean summary, use this:&lt;/p&gt;

&lt;p&gt;CNN training is a distribution-control problem.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Augmentation&lt;/strong&gt; controls variation in the data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Preprocessing&lt;/strong&gt; controls scale and structure in the input.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;BatchNorm&lt;/strong&gt; controls instability in internal representations.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once you separate those three bottlenecks, CNN training gets much less mysterious.&lt;/p&gt;

&lt;p&gt;And your debugging gets much faster too.&lt;/p&gt;




&lt;p&gt;Which one has given you the biggest gain in practice:&lt;br&gt;
augmentation, preprocessing, or BatchNorm?&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>deeplearning</category>
      <category>cnn</category>
      <category>datascience</category>
    </item>
    <item>
      <title>Evolution of Deep CNNs — From AlexNet to ResNet (Trade-offs Behind Modern Deep Learning)</title>
      <dc:creator>shangkyu shin</dc:creator>
      <pubDate>Sat, 11 Apr 2026 19:09:22 +0000</pubDate>
      <link>https://forem.com/zeromathai/evolution-of-deep-cnns-from-alexnet-to-resnet-trade-offs-behind-modern-deep-learning-16db</link>
      <guid>https://forem.com/zeromathai/evolution-of-deep-cnns-from-alexnet-to-resnet-trade-offs-behind-modern-deep-learning-16db</guid>
      <description>&lt;p&gt;Deep CNN evolution is not about deeper models — it’s about resolving engineering trade-offs under constraints.&lt;/p&gt;

&lt;p&gt;Cross-posted from Zeromath. Original article:&lt;br&gt;
&lt;a href="https://zeromathai.com/en/deep-cnn-evolution-alexnet-resnet-en/" rel="noopener noreferrer"&gt;https://zeromathai.com/en/deep-cnn-evolution-alexnet-resnet-en/&lt;/a&gt;&lt;/p&gt;




&lt;h1&gt;
  
  
  CNN Evolution = Constraint Evolution
&lt;/h1&gt;

&lt;p&gt;Every CNN generation answers a different question:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AlexNet → can it work?&lt;/li&gt;
&lt;li&gt;ZFNet → why does it work?&lt;/li&gt;
&lt;li&gt;VGG → does depth help?&lt;/li&gt;
&lt;li&gt;GoogLeNet → can we reduce compute?&lt;/li&gt;
&lt;li&gt;ResNet → can we optimize deeper networks?&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  1. AlexNet — Feasibility
&lt;/h1&gt;

&lt;p&gt;Solved:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;deep CNNs can actually work at scale&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Key ingredients:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GPU training&lt;/li&gt;
&lt;li&gt;ReLU&lt;/li&gt;
&lt;li&gt;Dropout&lt;/li&gt;
&lt;li&gt;data augmentation&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  2. ZFNet — Interpretability
&lt;/h1&gt;

&lt;p&gt;Solved:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;understanding internal representations matters&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Method:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;feature visualization&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Insight:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;debugging models improves architecture design&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h1&gt;
  
  
  3. VGG vs GoogLeNet — Real Trade-off
&lt;/h1&gt;

&lt;p&gt;This is the key architectural tension.&lt;/p&gt;




&lt;h2&gt;
  
  
  VGG
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;simple architecture&lt;/li&gt;
&lt;li&gt;stacked 3×3 conv&lt;/li&gt;
&lt;li&gt;very deep&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Problem:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;compute cost explodes&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  GoogLeNet
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Inception module&lt;/li&gt;
&lt;li&gt;multi-scale processing&lt;/li&gt;
&lt;li&gt;1×1 conv compression&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Problem:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;more complex design&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Trade-off
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;VGG&lt;/th&gt;
&lt;th&gt;GoogLeNet&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;simplicity&lt;/td&gt;
&lt;td&gt;efficiency&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;heavy compute&lt;/td&gt;
&lt;td&gt;optimized compute&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;depth scaling&lt;/td&gt;
&lt;td&gt;architectural branching&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h1&gt;
  
  
  Insight
&lt;/h1&gt;

&lt;blockquote&gt;
&lt;p&gt;CNN progress is trade-off engineering, not scaling&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h1&gt;
  
  
  4. ResNet — Optimization Fix
&lt;/h1&gt;

&lt;p&gt;Problem:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;deeper networks degrade performance&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Solution:&lt;/p&gt;

&lt;p&gt;:contentReference[oaicite:1]{index=1}&lt;/p&gt;

&lt;p&gt;Why it works:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;gradient flow improves&lt;/li&gt;
&lt;li&gt;identity mapping preserved&lt;/li&gt;
&lt;li&gt;optimization becomes easier&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  5. Big Picture
&lt;/h1&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Problem solved&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;AlexNet&lt;/td&gt;
&lt;td&gt;feasibility&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ZFNet&lt;/td&gt;
&lt;td&gt;interpretability&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;VGG&lt;/td&gt;
&lt;td&gt;depth scaling&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GoogLeNet&lt;/td&gt;
&lt;td&gt;efficiency&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ResNet&lt;/td&gt;
&lt;td&gt;optimization stability&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h1&gt;
  
  
  Key Pattern
&lt;/h1&gt;

&lt;p&gt;Every model follows:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;limitation appears&lt;/li&gt;
&lt;li&gt;root cause identified&lt;/li&gt;
&lt;li&gt;architecture changes&lt;/li&gt;
&lt;li&gt;scaling resumes&lt;/li&gt;
&lt;/ol&gt;




&lt;h1&gt;
  
  
  Final Insight
&lt;/h1&gt;

&lt;p&gt;Deep learning is not model evolution.&lt;/p&gt;

&lt;p&gt;It is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;continuous engineering under constraints&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;Discussion:&lt;/p&gt;

&lt;p&gt;Which constraint mattered most in practice?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;depth&lt;/li&gt;
&lt;li&gt;efficiency&lt;/li&gt;
&lt;li&gt;optimization&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>deeplearning</category>
      <category>cnn</category>
      <category>machinelearning</category>
      <category>computervision</category>
    </item>
    <item>
      <title>CNN Layer Composition — A Practical Developer Guide to Activation, Pooling, and Fully Connected Layers</title>
      <dc:creator>shangkyu shin</dc:creator>
      <pubDate>Sat, 11 Apr 2026 18:59:34 +0000</pubDate>
      <link>https://forem.com/zeromathai/cnn-layer-composition-a-practical-developer-guide-to-activation-pooling-and-fully-connected-288b</link>
      <guid>https://forem.com/zeromathai/cnn-layer-composition-a-practical-developer-guide-to-activation-pooling-and-fully-connected-288b</guid>
      <description>&lt;p&gt;CNNs are not just convolution stacks. This guide explains how activation, pooling, and fully connected layers work together to transform feature maps into predictions.&lt;/p&gt;

&lt;p&gt;Cross-posted from Zeromath. Original article: &lt;a href="https://zeromathai.com/en/cnn-layer-composition-en/" rel="noopener noreferrer"&gt;https://zeromathai.com/en/cnn-layer-composition-en/&lt;/a&gt;&lt;/p&gt;




&lt;h1&gt;
  
  
  CNN Layer Composition (Think Like an Engineer)
&lt;/h1&gt;

&lt;p&gt;A CNN is not magic.&lt;/p&gt;

&lt;p&gt;It’s a pipeline:&lt;/p&gt;

&lt;p&gt;input → feature extraction → filtering → compression → classification&lt;/p&gt;




&lt;h1&gt;
  
  
  1. Convolution Alone = Not Enough
&lt;/h1&gt;

&lt;p&gt;Convolution is linear.&lt;/p&gt;

&lt;p&gt;Stack linear layers:&lt;/p&gt;

&lt;p&gt;→ still linear&lt;/p&gt;

&lt;p&gt;So:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;no complex decision boundary
&lt;/li&gt;
&lt;li&gt;no deep feature learning
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Activation is mandatory.&lt;/p&gt;




&lt;h1&gt;
  
  
  2. ReLU — The Switch That Enables Depth
&lt;/h1&gt;

&lt;p&gt;ReLU:&lt;/p&gt;

&lt;p&gt;f(x) = max(0, x)&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;p&gt;[-3, -1, 0.5, 2] → [0, 0, 0.5, 2]&lt;/p&gt;

&lt;p&gt;Why it matters:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;introduces nonlinearity
&lt;/li&gt;
&lt;li&gt;avoids vanishing gradient
&lt;/li&gt;
&lt;li&gt;filters weak signals
&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  3. Shape Flow (Real Example)
&lt;/h1&gt;

&lt;p&gt;Input:&lt;br&gt;
(224, 224, 3)&lt;/p&gt;

&lt;p&gt;Conv:&lt;br&gt;
(224, 224, 64)&lt;/p&gt;

&lt;p&gt;ReLU:&lt;br&gt;
(224, 224, 64)&lt;/p&gt;

&lt;p&gt;Pooling:&lt;br&gt;
(112, 112, 64)&lt;/p&gt;

&lt;p&gt;Key rules:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;spatial ↓
&lt;/li&gt;
&lt;li&gt;channels same
&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  4. Why Channels Increase
&lt;/h1&gt;

&lt;p&gt;As depth increases:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;spatial size ↓
&lt;/li&gt;
&lt;li&gt;channel count ↑
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Why?&lt;/p&gt;

&lt;p&gt;→ model learns more feature types&lt;/p&gt;




&lt;h1&gt;
  
  
  5. Pooling vs Stride
&lt;/h1&gt;

&lt;p&gt;Pooling:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;fixed
&lt;/li&gt;
&lt;li&gt;no parameters
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Strided Conv:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;learnable
&lt;/li&gt;
&lt;li&gt;more flexible
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Modern models often prefer strided conv.&lt;/p&gt;




&lt;h1&gt;
  
  
  6. Max Pooling = Feature Selection
&lt;/h1&gt;

&lt;p&gt;2×2 max pooling:&lt;/p&gt;

&lt;p&gt;Input:&lt;br&gt;
1 1 2 4&lt;br&gt;&lt;br&gt;
5 6 7 8&lt;br&gt;&lt;br&gt;
3 2 1 0&lt;br&gt;&lt;br&gt;
1 2 3 4  &lt;/p&gt;

&lt;p&gt;Output:&lt;br&gt;
6 8&lt;br&gt;&lt;br&gt;
3 4  &lt;/p&gt;

&lt;p&gt;Effect:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;strongest signal survives
&lt;/li&gt;
&lt;li&gt;noise removed
&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  7. Receptive Field
&lt;/h1&gt;

&lt;p&gt;Deeper layers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;see more context
&lt;/li&gt;
&lt;li&gt;capture higher-level features
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Flow:&lt;/p&gt;

&lt;p&gt;edges → textures → shapes → objects&lt;/p&gt;




&lt;h1&gt;
  
  
  8. Flatten + Dense
&lt;/h1&gt;

&lt;p&gt;Before classification:&lt;/p&gt;

&lt;p&gt;(7, 7, 512) → (25088)&lt;/p&gt;

&lt;p&gt;Then:&lt;/p&gt;

&lt;p&gt;Dense → Softmax → prediction&lt;/p&gt;




&lt;h1&gt;
  
  
  9. Modern Trick: Global Average Pooling
&lt;/h1&gt;

&lt;p&gt;Instead of big dense layers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;average each channel
&lt;/li&gt;
&lt;li&gt;fewer parameters
&lt;/li&gt;
&lt;li&gt;better generalization
&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  10. Full Pipeline
&lt;/h1&gt;

&lt;ol&gt;
&lt;li&gt;Conv → detect
&lt;/li&gt;
&lt;li&gt;ReLU → filter
&lt;/li&gt;
&lt;li&gt;Pool → compress
&lt;/li&gt;
&lt;li&gt;Repeat → hierarchy
&lt;/li&gt;
&lt;li&gt;Dense → predict
&lt;/li&gt;
&lt;/ol&gt;




&lt;h1&gt;
  
  
  Debug Mindset
&lt;/h1&gt;

&lt;p&gt;If model fails:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;bad features → conv problem
&lt;/li&gt;
&lt;li&gt;weak signal → activation issue
&lt;/li&gt;
&lt;li&gt;too slow → pooling issue
&lt;/li&gt;
&lt;li&gt;wrong output → classifier issue
&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  Key Takeaways
&lt;/h1&gt;

&lt;ul&gt;
&lt;li&gt;CNN = structured system
&lt;/li&gt;
&lt;li&gt;ReLU enables learning
&lt;/li&gt;
&lt;li&gt;Pooling controls scale
&lt;/li&gt;
&lt;li&gt;Dense layers make decisions
&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  Discussion
&lt;/h1&gt;

&lt;p&gt;In real projects, what matters most?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;architecture design?&lt;/li&gt;
&lt;li&gt;training tricks?&lt;/li&gt;
&lt;li&gt;or data quality?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Curious to hear your experience.&lt;/p&gt;

</description>
      <category>deeplearning</category>
      <category>machinelearning</category>
      <category>cnn</category>
      <category>ai</category>
    </item>
    <item>
      <title>CNN Spatial Behavior Explained: Convolution, Stride, Padding, and Output Size (With Intuition)</title>
      <dc:creator>shangkyu shin</dc:creator>
      <pubDate>Sat, 11 Apr 2026 18:57:12 +0000</pubDate>
      <link>https://forem.com/zeromathai/spatial-behavior-of-convolution-in-cnns-stride-padding-and-feature-maps-explained-7i2</link>
      <guid>https://forem.com/zeromathai/spatial-behavior-of-convolution-in-cnns-stride-padding-and-feature-maps-explained-7i2</guid>
      <description>&lt;p&gt;Understanding CNNs requires more than just architectures. Learn how convolution, stride, padding, and output size shape spatial behavior in deep learning models, with practical intuition and real-world design insights.&lt;/p&gt;

&lt;p&gt;Cross-posted from Zeromath. Original article: &lt;a href="https://zeromathai.com/en/pooling-activation-layers-en/" rel="noopener noreferrer"&gt;https://zeromathai.com/en/pooling-activation-layers-en/&lt;/a&gt;&lt;/p&gt;




&lt;h1&gt;
  
  
  The Real Problem: Spatial Understanding (Not Layers)
&lt;/h1&gt;

&lt;p&gt;Most CNN issues are NOT about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;“Which architecture should I use?”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;They are about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;wrong tensor shapes
&lt;/li&gt;
&lt;li&gt;misunderstanding stride/padding
&lt;/li&gt;
&lt;li&gt;losing spatial information too early
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you’ve ever hit something like:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;RuntimeError: size mismatch
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This post is for you.&lt;/p&gt;




&lt;h1&gt;
  
  
  Convolution = Sliding Pattern Detector
&lt;/h1&gt;

&lt;p&gt;At each position:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Take a small patch&lt;/li&gt;
&lt;li&gt;Multiply with filter weights&lt;/li&gt;
&lt;li&gt;Sum → one output value&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Repeat → feature map&lt;/p&gt;

&lt;p&gt;Key properties:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;local connectivity&lt;/li&gt;
&lt;li&gt;shared weights&lt;/li&gt;
&lt;li&gt;translation awareness&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is why CNNs scale.&lt;/p&gt;




&lt;h1&gt;
  
  
  Filters: What Your Model Actually Learns
&lt;/h1&gt;

&lt;p&gt;Each filter learns ONE pattern.&lt;/p&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;edge detector&lt;/li&gt;
&lt;li&gt;texture detector&lt;/li&gt;
&lt;li&gt;color transition&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Multiple filters → multiple feature maps&lt;/p&gt;

&lt;p&gt;Conceptually:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;output_channels = num_filters
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Important:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;CNNs don’t learn “images” — they learn patterns.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h1&gt;
  
  
  Receptive Field (Core Concept)
&lt;/h1&gt;

&lt;p&gt;Each neuron sees only part of the image.&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Input: 32×32&lt;/li&gt;
&lt;li&gt;Kernel: 5×5&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;→ neuron sees 5×5 region&lt;/p&gt;

&lt;p&gt;Stack layers:&lt;br&gt;
→ receptive field grows&lt;/p&gt;

&lt;p&gt;Meaning:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;early layers → local features&lt;/li&gt;
&lt;li&gt;deeper layers → global features&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  Stride = Resolution Control
&lt;/h1&gt;

&lt;p&gt;Stride defines how far the filter moves.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Stride&lt;/th&gt;
&lt;th&gt;Effect&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;high detail&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;downsample&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Trade-off:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;larger stride → faster&lt;/li&gt;
&lt;li&gt;but → information loss&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Real-world mistake:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;using stride=2 too early → model misses fine features&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h1&gt;
  
  
  Padding = Boundary Control
&lt;/h1&gt;

&lt;p&gt;Without padding:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;output shrinks&lt;/li&gt;
&lt;li&gt;edge information disappears fast&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With padding:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;spatial size preserved&lt;/li&gt;
&lt;li&gt;borders are kept&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Typical implementation:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;padding = (kernel_size - 1) // 2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Rule of thumb:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;deep CNNs almost always use padding&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h1&gt;
  
  
  Output Size Formula (You MUST Know This)
&lt;/h1&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Output = (m - k) / s + 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;m = input size&lt;/li&gt;
&lt;li&gt;k = kernel size&lt;/li&gt;
&lt;li&gt;s = stride&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(7 - 3) / 1 + 1 = 5
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;If you don’t calculate this:&lt;br&gt;
→ your model WILL break&lt;/p&gt;




&lt;h1&gt;
  
  
  One Filter vs Many Filters
&lt;/h1&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Filters&lt;/th&gt;
&lt;th&gt;Output&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1 feature map&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;td&gt;32 channels&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Output shape:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;H × W × C
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;C = number of filters&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;More filters = richer representation&lt;/p&gt;




&lt;h1&gt;
  
  
  Common Real-World Mistakes
&lt;/h1&gt;

&lt;h3&gt;
  
  
  1. Shape mismatch
&lt;/h3&gt;

&lt;p&gt;You didn’t compute output size correctly.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Too much downsampling
&lt;/h3&gt;

&lt;p&gt;Large stride early → lost spatial information.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. No padding
&lt;/h3&gt;

&lt;p&gt;Edges vanish layer by layer.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Too few filters
&lt;/h3&gt;

&lt;p&gt;Model lacks expressive power.&lt;/p&gt;




&lt;h1&gt;
  
  
  Design Intuition (What Actually Matters)
&lt;/h1&gt;

&lt;p&gt;When designing CNNs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;kernel size → what patterns you detect
&lt;/li&gt;
&lt;li&gt;stride → how fast you compress
&lt;/li&gt;
&lt;li&gt;padding → whether you preserve structure
&lt;/li&gt;
&lt;li&gt;filters → how rich your representation is
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is not hyperparameter tuning.&lt;/p&gt;

&lt;p&gt;This is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;designing how your model perceives the world&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h1&gt;
  
  
  Final Takeaway
&lt;/h1&gt;

&lt;p&gt;CNNs don’t “see images”.&lt;/p&gt;

&lt;p&gt;They:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;scan locally
&lt;/li&gt;
&lt;li&gt;extract patterns
&lt;/li&gt;
&lt;li&gt;build hierarchical representations
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you understand:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;convolution&lt;/li&gt;
&lt;li&gt;receptive field&lt;/li&gt;
&lt;li&gt;stride&lt;/li&gt;
&lt;li&gt;padding&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then you understand:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;how CNNs actually work&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;What part of CNN design still feels confusing?&lt;/p&gt;

&lt;p&gt;Drop your thoughts 👇&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>deeplearning</category>
      <category>ai</category>
      <category>computervision</category>
    </item>
    <item>
      <title>Why CNNs Work: Convolution, Feature Hierarchies, and the Real Difference from Fully Connected Networks</title>
      <dc:creator>shangkyu shin</dc:creator>
      <pubDate>Sat, 11 Apr 2026 18:46:06 +0000</pubDate>
      <link>https://forem.com/zeromathai/why-cnns-work-convolution-feature-hierarchies-and-the-real-difference-from-fully-connected-4f00</link>
      <guid>https://forem.com/zeromathai/why-cnns-work-convolution-feature-hierarchies-and-the-real-difference-from-fully-connected-4f00</guid>
      <description>&lt;p&gt;Understanding CNNs is not about memorizing layers.&lt;/p&gt;

&lt;p&gt;It’s about understanding why this design exists.&lt;/p&gt;

&lt;p&gt;Cross-posted from Zeromath. Original article: &lt;a href="https://zeromathai.com/en/convolutional-layer-lec-en/" rel="noopener noreferrer"&gt;https://zeromathai.com/en/convolutional-layer-lec-en/&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Core Problem
&lt;/h2&gt;

&lt;p&gt;Images are structured data.&lt;/p&gt;

&lt;p&gt;A fully connected network treats them as flat vectors.&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;p&gt;224×224×3 → 150K inputs&lt;br&gt;&lt;br&gt;
Dense layer → millions of parameters  &lt;/p&gt;

&lt;p&gt;Problems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No spatial awareness
&lt;/li&gt;
&lt;li&gt;Too many parameters
&lt;/li&gt;
&lt;li&gt;Overfitting
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What CNNs Fix
&lt;/h2&gt;

&lt;p&gt;CNN introduces two key ideas:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Local connectivity
&lt;/li&gt;
&lt;li&gt;Weight sharing
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Instead of connecting everything:&lt;br&gt;
→ look locally, reuse globally  &lt;/p&gt;




&lt;h2&gt;
  
  
  CNN Pipeline
&lt;/h2&gt;

&lt;p&gt;Image → Conv → ReLU → Pool → Conv → ... → FC → Softmax&lt;/p&gt;




&lt;h2&gt;
  
  
  Convolution Layer
&lt;/h2&gt;

&lt;p&gt;A filter slides across the image.&lt;/p&gt;

&lt;p&gt;At each position:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multiply
&lt;/li&gt;
&lt;li&gt;Sum
&lt;/li&gt;
&lt;li&gt;Output activation
&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Shape Example
&lt;/h3&gt;

&lt;p&gt;Input: 32×32×3&lt;br&gt;&lt;br&gt;
Filter: 5×5×3&lt;br&gt;&lt;br&gt;
Output: 28×28  &lt;/p&gt;




&lt;h3&gt;
  
  
  Why It Works
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Detects local patterns
&lt;/li&gt;
&lt;li&gt;Works anywhere
&lt;/li&gt;
&lt;li&gt;Learns reusable features
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Feature Maps
&lt;/h2&gt;

&lt;p&gt;Feature maps are representations.&lt;/p&gt;

&lt;p&gt;They answer:&lt;/p&gt;

&lt;p&gt;→ where is this feature?&lt;/p&gt;




&lt;h2&gt;
  
  
  ReLU (Critical)
&lt;/h2&gt;

&lt;p&gt;f(x) = max(0, x)&lt;/p&gt;

&lt;p&gt;Without it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Model is linear
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Nonlinear learning
&lt;/li&gt;
&lt;li&gt;Better optimization
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Pooling Layer
&lt;/h2&gt;

&lt;p&gt;28×28 → 14×14  &lt;/p&gt;

&lt;p&gt;Benefits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Faster
&lt;/li&gt;
&lt;li&gt;More robust
&lt;/li&gt;
&lt;li&gt;Translation invariant (approx)
&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Important Insight
&lt;/h3&gt;

&lt;p&gt;CNNs are not truly translation invariant.&lt;br&gt;&lt;br&gt;
Pooling only makes them more robust to shifts.&lt;/p&gt;

&lt;p&gt;Too much pooling:&lt;br&gt;
→ destroys spatial detail  &lt;/p&gt;

&lt;p&gt;Modern CNNs:&lt;br&gt;
→ reduce pooling&lt;br&gt;&lt;br&gt;
→ use strided convolution  &lt;/p&gt;




&lt;h2&gt;
  
  
  Fully Connected Layer
&lt;/h2&gt;

&lt;p&gt;Flatten → combine features → classify  &lt;/p&gt;

&lt;p&gt;Softmax → probabilities  &lt;/p&gt;




&lt;h2&gt;
  
  
  Feature Hierarchy (Core Idea)
&lt;/h2&gt;

&lt;p&gt;CNNs learn progressively:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Learns&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Early&lt;/td&gt;
&lt;td&gt;edges&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Middle&lt;/td&gt;
&lt;td&gt;textures&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deep&lt;/td&gt;
&lt;td&gt;objects&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Example:&lt;br&gt;
edge → eye → face  &lt;/p&gt;




&lt;h2&gt;
  
  
  Why CNNs Beat Dense Networks
&lt;/h2&gt;

&lt;p&gt;CNN:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Efficient
&lt;/li&gt;
&lt;li&gt;Spatially aware
&lt;/li&gt;
&lt;li&gt;Generalizes well
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Dense:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Huge parameter count
&lt;/li&gt;
&lt;li&gt;No structure awareness
&lt;/li&gt;
&lt;li&gt;Overfits
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Debugging CNNs (Underrated Skill)
&lt;/h2&gt;

&lt;p&gt;Use:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Activation maps
&lt;/li&gt;
&lt;li&gt;Saliency maps
&lt;/li&gt;
&lt;li&gt;Grad-CAM
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These help:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Debug errors
&lt;/li&gt;
&lt;li&gt;Understand predictions
&lt;/li&gt;
&lt;li&gt;Improve models
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Practical Tips
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Don’t overuse pooling
&lt;/li&gt;
&lt;li&gt;Track feature map sizes
&lt;/li&gt;
&lt;li&gt;Prefer depth over width
&lt;/li&gt;
&lt;li&gt;Visualize early
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Final Insight
&lt;/h2&gt;

&lt;p&gt;The real breakthrough of CNNs is not just convolution.&lt;/p&gt;

&lt;p&gt;It is the combination of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Locality
&lt;/li&gt;
&lt;li&gt;Parameter sharing
&lt;/li&gt;
&lt;li&gt;Hierarchical learning
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That’s what turns pixels into meaning.&lt;/p&gt;




&lt;p&gt;For image tasks today, do you still start with CNNs, or jump straight to Vision Transformers?&lt;/p&gt;

&lt;p&gt;Let’s discuss 👇&lt;/p&gt;

</description>
      <category>ai</category>
      <category>deeplearning</category>
      <category>machinelearning</category>
      <category>computervision</category>
    </item>
    <item>
      <title>Why CNNs Work for Images: The Real Design Logic Behind Convolutional Neural Networks</title>
      <dc:creator>shangkyu shin</dc:creator>
      <pubDate>Sat, 11 Apr 2026 18:40:32 +0000</pubDate>
      <link>https://forem.com/zeromathai/why-cnns-work-for-images-the-real-design-logic-behind-convolutional-neural-networks-1j30</link>
      <guid>https://forem.com/zeromathai/why-cnns-work-for-images-the-real-design-logic-behind-convolutional-neural-networks-1j30</guid>
      <description>&lt;p&gt;Why do CNNs outperform fully connected neural networks on image tasks? This article explains local connectivity, weight sharing, pooling, and inductive bias in a practical, developer-friendly way.&lt;/p&gt;

&lt;p&gt;Cross-posted from Zeromath. Original article: &lt;a href="https://zeromathai.com/en/introduction-to-cnns-en/" rel="noopener noreferrer"&gt;https://zeromathai.com/en/introduction-to-cnns-en/&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Why CNNs Needed to Exist
&lt;/h1&gt;

&lt;p&gt;CNNs were not invented just because researchers wanted a better benchmark score.&lt;/p&gt;

&lt;p&gt;They were invented because applying a standard Multilayer Perceptron to images is a bad fit.&lt;/p&gt;

&lt;p&gt;A fully connected network treats an image as a long flat vector. That already hints at the problem: images are not flat in any meaningful visual sense.&lt;/p&gt;

&lt;p&gt;They are spatial.&lt;/p&gt;

&lt;p&gt;They have local patterns.&lt;/p&gt;

&lt;p&gt;And the same useful feature can appear in different positions.&lt;/p&gt;

&lt;p&gt;That mismatch is the whole reason CNNs matter.&lt;/p&gt;

&lt;h1&gt;
  
  
  The MLP Problem in One Example
&lt;/h1&gt;

&lt;p&gt;Take a 200 × 200 RGB image.&lt;/p&gt;

&lt;p&gt;That gives:&lt;/p&gt;

&lt;p&gt;200 × 200 × 3 = 120,000 input values&lt;/p&gt;

&lt;p&gt;Now connect that input to a hidden layer with 1,000 neurons.&lt;/p&gt;

&lt;p&gt;You get about 120 million weights.&lt;/p&gt;

&lt;p&gt;That is bad for three reasons:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;training cost explodes,&lt;/li&gt;
&lt;li&gt;overfitting risk goes up,&lt;/li&gt;
&lt;li&gt;and the model still has no built-in understanding of spatial structure.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So the issue is not just "too many parameters."&lt;/p&gt;

&lt;p&gt;The deeper issue is that a dense layer starts from the wrong assumption.&lt;/p&gt;

&lt;h1&gt;
  
  
  Images Have Structure, Not Just Size
&lt;/h1&gt;

&lt;p&gt;For tabular data, treating inputs as a feature vector is often fine.&lt;/p&gt;

&lt;p&gt;For images, it is not.&lt;/p&gt;

&lt;p&gt;Why?&lt;/p&gt;

&lt;p&gt;Because image data has properties that matter directly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;nearby pixels are correlated,&lt;/li&gt;
&lt;li&gt;edges and textures are local,&lt;/li&gt;
&lt;li&gt;and object identity often survives small position changes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A cat is still a cat whether it appears slightly left or slightly right.&lt;/p&gt;

&lt;p&gt;A model for images should reflect that.&lt;/p&gt;

&lt;h1&gt;
  
  
  The CNN Idea
&lt;/h1&gt;

&lt;p&gt;CNNs solve this by injecting a useful inductive bias.&lt;/p&gt;

&lt;p&gt;Instead of saying, "learn everything from scratch," CNNs say:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;local patterns matter,&lt;/li&gt;
&lt;li&gt;the same pattern can appear anywhere,&lt;/li&gt;
&lt;li&gt;and spatial layout should be preserved long enough to build higher-level features.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That one architectural choice changes both efficiency and generalization.&lt;/p&gt;

&lt;h1&gt;
  
  
  1. Local Connectivity
&lt;/h1&gt;

&lt;p&gt;In a dense layer, each neuron connects to the full input.&lt;/p&gt;

&lt;p&gt;In a convolutional layer, each neuron looks at a small local patch.&lt;/p&gt;

&lt;p&gt;That patch is the receptive field.&lt;/p&gt;

&lt;p&gt;This makes sense for images because most meaningful low-level features are local:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;edges,&lt;/li&gt;
&lt;li&gt;corners,&lt;/li&gt;
&lt;li&gt;texture fragments.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You do not need the whole image to detect a vertical edge in one region.&lt;/p&gt;

&lt;p&gt;From an engineering perspective, local connectivity dramatically cuts parameter count.&lt;br&gt;
From a modeling perspective, it aligns the network with the structure of visual data.&lt;/p&gt;

&lt;h1&gt;
  
  
  2. Weight Sharing
&lt;/h1&gt;

&lt;p&gt;This is the design principle that makes convolution feel elegant.&lt;/p&gt;

&lt;p&gt;A CNN does not learn a separate edge detector for every location in the image.&lt;/p&gt;

&lt;p&gt;It learns one filter and applies it across locations.&lt;/p&gt;

&lt;p&gt;That means the same detector can fire on the left side, center, or right side of the input.&lt;/p&gt;

&lt;p&gt;This gives us two big wins.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fewer parameters
&lt;/h2&gt;

&lt;p&gt;Instead of learning duplicated weights for similar patterns at many positions, the model reuses the same filter.&lt;/p&gt;

&lt;h2&gt;
  
  
  Consistent feature detection
&lt;/h2&gt;

&lt;p&gt;If the input shifts, the activation pattern shifts consistently.&lt;/p&gt;

&lt;p&gt;That is translation equivariance.&lt;/p&gt;

&lt;p&gt;A simple intuition:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;move the edge in the input,&lt;/li&gt;
&lt;li&gt;and the edge response moves in the feature map.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For early visual processing, that is exactly what we want.&lt;/p&gt;

&lt;h1&gt;
  
  
  3. Pooling
&lt;/h1&gt;

&lt;p&gt;Pooling is often introduced as a downsampling step, and that is true, but it is more useful to think of it as controlled compression.&lt;/p&gt;

&lt;p&gt;It reduces the size of feature maps while preserving the strongest or most representative signals.&lt;/p&gt;

&lt;p&gt;Common examples are max pooling and average pooling.&lt;/p&gt;

&lt;p&gt;Why is that useful?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;later layers become cheaper,&lt;/li&gt;
&lt;li&gt;small local changes matter less,&lt;/li&gt;
&lt;li&gt;and the network becomes more robust to noise or slight shifts.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A subtle but important point: pooling does not create perfect invariance by itself.&lt;/p&gt;

&lt;p&gt;What it really gives is robustness to minor local variation.&lt;/p&gt;

&lt;p&gt;That is a better mental model.&lt;/p&gt;

&lt;h1&gt;
  
  
  Why CNNs Usually Generalize Better
&lt;/h1&gt;

&lt;p&gt;CNNs are not just smaller versions of dense networks.&lt;/p&gt;

&lt;p&gt;They are structured models.&lt;/p&gt;

&lt;p&gt;That matters because generalization improves when the architecture matches the data domain.&lt;/p&gt;

&lt;p&gt;CNNs help by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;reducing unnecessary degrees of freedom,&lt;/li&gt;
&lt;li&gt;forcing local pattern learning,&lt;/li&gt;
&lt;li&gt;reusing filters across space,&lt;/li&gt;
&lt;li&gt;and preserving spatial organization through feature maps.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So when people say CNNs are efficient, they do not just mean "faster."&lt;/p&gt;

&lt;p&gt;They mean the model wastes less capacity on unrealistic hypotheses.&lt;/p&gt;

&lt;h1&gt;
  
  
  Feature Maps and Hierarchical Learning
&lt;/h1&gt;

&lt;p&gt;One of the nicest ways to understand CNNs is to think in terms of feature maps.&lt;/p&gt;

&lt;p&gt;A filter scans the image and produces a map showing where that filter’s learned pattern appears.&lt;/p&gt;

&lt;p&gt;Early filters often learn things like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;horizontal edges,&lt;/li&gt;
&lt;li&gt;vertical edges,&lt;/li&gt;
&lt;li&gt;simple textures.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Deeper layers then combine those into:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;contours,&lt;/li&gt;
&lt;li&gt;repeated motifs,&lt;/li&gt;
&lt;li&gt;parts of objects,&lt;/li&gt;
&lt;li&gt;object-level patterns.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is hierarchical representation learning.&lt;/p&gt;

&lt;p&gt;In practice, CNNs move from "small visual primitives" to "larger semantic concepts."&lt;/p&gt;

&lt;p&gt;That is why deep convolutional networks became so effective in computer vision.&lt;/p&gt;

&lt;h1&gt;
  
  
  A Useful Comparison: MLP vs CNN
&lt;/h1&gt;

&lt;p&gt;Here is the cleanest mental contrast.&lt;/p&gt;

&lt;h2&gt;
  
  
  MLP
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;treats input as a flat vector,&lt;/li&gt;
&lt;li&gt;uses dense connectivity,&lt;/li&gt;
&lt;li&gt;learns with few built-in assumptions about image structure.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  CNN
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;treats input as spatial data,&lt;/li&gt;
&lt;li&gt;uses local connectivity,&lt;/li&gt;
&lt;li&gt;shares weights across locations,&lt;/li&gt;
&lt;li&gt;builds layered feature hierarchies.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So the difference is not just architecture style.&lt;/p&gt;

&lt;p&gt;It is a difference in how the model thinks the data is organized.&lt;/p&gt;

&lt;h1&gt;
  
  
  Why the Architecture History Matters
&lt;/h1&gt;

&lt;p&gt;Once these core ideas were established, later CNN families mainly improved optimization, depth, and efficiency.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AlexNet showed deep CNNs could dominate large-scale image recognition.&lt;/li&gt;
&lt;li&gt;VGG showed that stacking simple small convolutions could work extremely well.&lt;/li&gt;
&lt;li&gt;GoogLeNet improved efficiency and multi-scale processing.&lt;/li&gt;
&lt;li&gt;ResNet made very deep networks train reliably with skip connections.&lt;/li&gt;
&lt;li&gt;DenseNet pushed feature reuse even further.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Different design, same foundation.&lt;/p&gt;

&lt;p&gt;All of them rely on the logic introduced above.&lt;/p&gt;

&lt;h1&gt;
  
  
  The Real Lesson
&lt;/h1&gt;

&lt;p&gt;The most important thing to learn from CNNs is bigger than CNNs.&lt;/p&gt;

&lt;p&gt;Good model design is about matching architecture to data structure.&lt;/p&gt;

&lt;p&gt;For images, that means locality, repeated patterns, and spatial hierarchy.&lt;/p&gt;

&lt;p&gt;CNNs encode those assumptions directly.&lt;/p&gt;

&lt;p&gt;That is why they work.&lt;/p&gt;

&lt;h1&gt;
  
  
  Final Takeaway
&lt;/h1&gt;

&lt;p&gt;If you remember one line, remember this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;MLPs treat images like generic vectors. CNNs treat images like images.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is the real reason convolution changed computer vision.&lt;/p&gt;

&lt;p&gt;What part of CNN design do you think mattered most historically: local connectivity, weight sharing, or later innovations like residual connections?&lt;/p&gt;

</description>
      <category>ai</category>
      <category>deeplearning</category>
      <category>machinelearning</category>
      <category>computervision</category>
    </item>
    <item>
      <title>Image Classification Explained — Why k-NN Breaks and Linear Classifiers Matter</title>
      <dc:creator>shangkyu shin</dc:creator>
      <pubDate>Sat, 11 Apr 2026 18:13:54 +0000</pubDate>
      <link>https://forem.com/zeromathai/image-classification-explained-why-k-nn-breaks-and-linear-classifiers-matter-106h</link>
      <guid>https://forem.com/zeromathai/image-classification-explained-why-k-nn-breaks-and-linear-classifiers-matter-106h</guid>
      <description>&lt;p&gt;Image classification sounds easy until you remember that a computer never sees “objects.” It only sees pixel arrays. This post explains why that makes k-NN a useful but limited baseline, and why linear classifiers are the point where real learning begins.&lt;/p&gt;

&lt;p&gt;Cross-posted from Zeromath. Original article: &lt;a href="https://zeromathai.com/en/image-classification-en/" rel="noopener noreferrer"&gt;https://zeromathai.com/en/image-classification-en/&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Start from the Actual Engineering Problem
&lt;/h2&gt;

&lt;p&gt;We usually describe image classification like this:&lt;/p&gt;

&lt;p&gt;input: image&lt;br&gt;&lt;br&gt;
output: label  &lt;/p&gt;

&lt;p&gt;That description is correct, but it hides the hard part.&lt;/p&gt;

&lt;p&gt;For a machine, an image is not “a cat” or “a truck.”&lt;br&gt;&lt;br&gt;
It is just something like:&lt;/p&gt;

&lt;p&gt;248 × 400 × 3 numbers&lt;/p&gt;

&lt;p&gt;So the real problem is:&lt;/p&gt;

&lt;p&gt;How do you map raw pixel values to a meaningful class?&lt;/p&gt;

&lt;p&gt;That question sits under a lot of computer vision work. Classification is the base layer. Object detection adds location. Segmentation adds per-pixel labeling. But the first wall you hit is still the same one: turning numeric arrays into semantic meaning.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Raw Pixels Are a Bad Starting Space
&lt;/h2&gt;

&lt;p&gt;Here is the simplest failure case.&lt;/p&gt;

&lt;p&gt;Take an image of a cat.&lt;/p&gt;

&lt;p&gt;Now:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;shift it 2 pixels to the right
&lt;/li&gt;
&lt;li&gt;slightly increase brightness
&lt;/li&gt;
&lt;li&gt;crop a small region
&lt;/li&gt;
&lt;li&gt;change the background
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To a human, it is still clearly a cat.&lt;/p&gt;

&lt;p&gt;To a model using raw pixel distance, it can look very different.&lt;/p&gt;

&lt;p&gt;This is the core issue:&lt;/p&gt;

&lt;p&gt;pixel space is not semantic space&lt;/p&gt;

&lt;p&gt;Two inputs can be far apart numerically but identical in meaning.&lt;br&gt;&lt;br&gt;
Two inputs can be close numerically but represent different objects.&lt;/p&gt;

&lt;p&gt;Real-world images make this worse due to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;viewpoint changes
&lt;/li&gt;
&lt;li&gt;scale differences
&lt;/li&gt;
&lt;li&gt;deformation
&lt;/li&gt;
&lt;li&gt;occlusion
&lt;/li&gt;
&lt;li&gt;lighting variation
&lt;/li&gt;
&lt;li&gt;background clutter
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A good model must ignore what does not matter and respond to what does.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Rule-Based Vision Fails
&lt;/h2&gt;

&lt;p&gt;A natural early idea is to define objects manually.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;cats have ears
&lt;/li&gt;
&lt;li&gt;cats have whiskers
&lt;/li&gt;
&lt;li&gt;cats have certain shapes
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This breaks quickly.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ears may be hidden
&lt;/li&gt;
&lt;li&gt;lighting may remove edges
&lt;/li&gt;
&lt;li&gt;backgrounds may look similar
&lt;/li&gt;
&lt;li&gt;poses may distort shapes
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Rule-based vision fails because the visual world is too variable.&lt;/p&gt;

&lt;p&gt;This is why machine learning shifted to a data-driven approach:&lt;br&gt;
collect examples, learn patterns, and generalize instead of hardcoding rules.&lt;/p&gt;

&lt;h2&gt;
  
  
  First Baseline: k-Nearest Neighbor (k-NN)
&lt;/h2&gt;

&lt;p&gt;The most intuitive classifier is k-NN.&lt;/p&gt;

&lt;p&gt;Idea:&lt;br&gt;
find similar images and reuse their labels&lt;/p&gt;

&lt;p&gt;Basic flow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;store all training data
&lt;/li&gt;
&lt;li&gt;compute distance to each sample
&lt;/li&gt;
&lt;li&gt;pick top-k closest
&lt;/li&gt;
&lt;li&gt;vote
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Why developers still use k-NN:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;simple baseline
&lt;/li&gt;
&lt;li&gt;quick sanity check
&lt;/li&gt;
&lt;li&gt;useful for debugging datasets
&lt;/li&gt;
&lt;li&gt;exposes whether representation makes sense
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Where k-NN Breaks
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Shift sensitivity&lt;br&gt;&lt;br&gt;
Small translations change pixel alignment everywhere.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Lighting sensitivity&lt;br&gt;&lt;br&gt;
Brightness changes affect all pixels.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Flattening destroys structure&lt;br&gt;&lt;br&gt;
image → flatten → vector&lt;br&gt;&lt;br&gt;
You lose spatial relationships and locality.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;High-dimensional issues&lt;br&gt;&lt;br&gt;
Distances become less meaningful in high dimensions.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Performance problems&lt;br&gt;&lt;br&gt;
O(N) comparisons per prediction&lt;br&gt;&lt;br&gt;
High memory usage&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Core Insight
&lt;/h2&gt;

&lt;p&gt;k-NN does not learn.&lt;/p&gt;

&lt;p&gt;It memorizes the dataset and compares at test time.&lt;/p&gt;

&lt;p&gt;This is useful for intuition, but not scalable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Validation Still Matters
&lt;/h2&gt;

&lt;p&gt;Even with k-NN, you must choose:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;value of k
&lt;/li&gt;
&lt;li&gt;distance metric
&lt;/li&gt;
&lt;li&gt;preprocessing
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are hyperparameters.&lt;/p&gt;

&lt;p&gt;Validation or cross-validation helps you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;compare configurations
&lt;/li&gt;
&lt;li&gt;avoid overfitting
&lt;/li&gt;
&lt;li&gt;select better setups
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This pattern continues in all machine learning models.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Shift That Changes Everything
&lt;/h2&gt;

&lt;p&gt;To move forward, we stop asking:&lt;/p&gt;

&lt;p&gt;which stored images are closest?&lt;/p&gt;

&lt;p&gt;and start asking:&lt;/p&gt;

&lt;p&gt;can we learn a function that predicts directly?&lt;/p&gt;

&lt;h2&gt;
  
  
  Linear Classifier: Where Learning Begins
&lt;/h2&gt;

&lt;p&gt;A linear classifier computes:&lt;/p&gt;

&lt;p&gt;score = W × x + b&lt;/p&gt;

&lt;p&gt;Where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;x is the input vector
&lt;/li&gt;
&lt;li&gt;W is the weight matrix
&lt;/li&gt;
&lt;li&gt;b is the bias
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now the model:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;does not need the full dataset at inference
&lt;/li&gt;
&lt;li&gt;computes predictions in constant time
&lt;/li&gt;
&lt;li&gt;learns parameters from data
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why This Matters
&lt;/h2&gt;

&lt;p&gt;k-NN vs Linear Classifier:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;similarity lookup vs learned function
&lt;/li&gt;
&lt;li&gt;no training vs parameter learning
&lt;/li&gt;
&lt;li&gt;slow inference vs fast inference
&lt;/li&gt;
&lt;li&gt;high memory vs compact model
&lt;/li&gt;
&lt;li&gt;weak generalization vs stronger generalization
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What Actually Changed
&lt;/h2&gt;

&lt;p&gt;Not just performance.&lt;/p&gt;

&lt;p&gt;Conceptually:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;k-NN → similarity-based reasoning
&lt;/li&gt;
&lt;li&gt;linear model → learned representation
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the moment where machine learning becomes actual learning.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Developers Should Care
&lt;/h2&gt;

&lt;p&gt;If you work with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CNNs
&lt;/li&gt;
&lt;li&gt;vision models
&lt;/li&gt;
&lt;li&gt;deep learning systems
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;this is your foundation.&lt;/p&gt;

&lt;p&gt;Understanding this explains:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;why raw pixels are not enough
&lt;/li&gt;
&lt;li&gt;why feature learning matters
&lt;/li&gt;
&lt;li&gt;why deep architectures exist
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Final Takeaway
&lt;/h2&gt;

&lt;p&gt;Image classification is not just predicting labels.&lt;/p&gt;

&lt;p&gt;It is about turning unstable raw pixel inputs into stable semantic outputs.&lt;/p&gt;

&lt;p&gt;k-NN is a great teaching tool and debugging baseline.&lt;br&gt;&lt;br&gt;
But it shows exactly why we need something better.&lt;/p&gt;

&lt;p&gt;Linear classifiers matter because they introduce learning.&lt;/p&gt;

&lt;p&gt;And that is where modern computer vision really begins.&lt;/p&gt;

&lt;h2&gt;
  
  
  Discussion
&lt;/h2&gt;

&lt;p&gt;Do you still use k-NN as a baseline or debugging tool?&lt;/p&gt;

&lt;p&gt;Or do you jump straight into learned models like CNNs?&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>computervision</category>
      <category>deeplearning</category>
    </item>
    <item>
      <title>CNNs Explained: How Image Classification Actually Works in Deep Learning</title>
      <dc:creator>shangkyu shin</dc:creator>
      <pubDate>Sat, 11 Apr 2026 18:10:19 +0000</pubDate>
      <link>https://forem.com/zeromathai/cnns-explained-how-image-classification-actually-works-in-deep-learning-2mbp</link>
      <guid>https://forem.com/zeromathai/cnns-explained-how-image-classification-actually-works-in-deep-learning-2mbp</guid>
      <description>&lt;p&gt;Understanding CNNs means understanding how models turn raw pixels into structured representations. This guide explains convolution, pooling, and architectures like ResNet with practical insights.&lt;/p&gt;

&lt;p&gt;Cross-posted from Zeromath. Original article: &lt;a href="https://zeromathai.com/en/dl-convolutional-neural-networks-cnn-en/" rel="noopener noreferrer"&gt;https://zeromathai.com/en/dl-convolutional-neural-networks-cnn-en/&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Real Problem: Pixels → Meaning
&lt;/h2&gt;

&lt;p&gt;Images are just tensors.&lt;/p&gt;

&lt;p&gt;No objects. No semantics.&lt;/p&gt;

&lt;p&gt;So the real question is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;How do we extract structure from raw data?&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Why Old Pipelines Didn’t Scale
&lt;/h2&gt;

&lt;p&gt;Classic approach:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Feature extraction (SIFT, HOG)&lt;/li&gt;
&lt;li&gt;Classifier (SVM)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Limitation:&lt;/p&gt;

&lt;p&gt;You only learn what you design.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why MLPs Fail (Critical Insight)
&lt;/h2&gt;

&lt;p&gt;Flattening images destroys structure.&lt;/p&gt;

&lt;p&gt;Problems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Parameter explosion&lt;/li&gt;
&lt;li&gt;No spatial awareness&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But the deeper issue:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;No reuse of patterns&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  CNNs = Structured Efficiency
&lt;/h2&gt;

&lt;p&gt;CNNs fix this with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Local connectivity&lt;/li&gt;
&lt;li&gt;Weight sharing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Meaning:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fewer parameters&lt;/li&gt;
&lt;li&gt;Better generalization&lt;/li&gt;
&lt;li&gt;Built-in spatial bias&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What Convolution Actually Learns
&lt;/h2&gt;

&lt;p&gt;Filters become detectors:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Edges&lt;/li&gt;
&lt;li&gt;Textures&lt;/li&gt;
&lt;li&gt;Shapes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Stacking layers creates hierarchy:&lt;/p&gt;

&lt;p&gt;Edges → shapes → objects&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Depth Matters (Practical View)
&lt;/h2&gt;

&lt;p&gt;Shallow model:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Detects edges&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Deep model:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Understands objects&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Depth = abstraction&lt;/p&gt;




&lt;h2&gt;
  
  
  Core Components (What Actually Matters)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  ReLU
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Stabilizes gradients&lt;/li&gt;
&lt;li&gt;Enables deep learning&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Pooling
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Reduces noise&lt;/li&gt;
&lt;li&gt;Adds robustness&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Fully Connected
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Final decision layer&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Why ResNet Changed Everything
&lt;/h2&gt;

&lt;p&gt;Deep networks used to fail.&lt;/p&gt;

&lt;h3&gt;
  
  
  Problem:
&lt;/h3&gt;

&lt;p&gt;Degradation with depth&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution:
&lt;/h3&gt;

&lt;p&gt;Skip connections&lt;/p&gt;




&lt;h3&gt;
  
  
  Real Effect:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Easier training&lt;/li&gt;
&lt;li&gt;Deeper models&lt;/li&gt;
&lt;li&gt;Better results&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Training Insights (This Is Where Most Bugs Are)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Data Augmentation &amp;gt; Architecture (Often)
&lt;/h3&gt;

&lt;p&gt;Small dataset?&lt;/p&gt;

&lt;p&gt;→ augmentation matters more than model choice&lt;/p&gt;




&lt;h3&gt;
  
  
  2. BatchNorm = Stability
&lt;/h3&gt;

&lt;p&gt;Without it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;training unstable&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;faster convergence&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  3. Preprocessing Is Not Optional
&lt;/h3&gt;

&lt;p&gt;Unnormalized input = unstable gradients&lt;/p&gt;




&lt;h2&gt;
  
  
  Debugging CNNs (Highly Practical)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Feature Maps
&lt;/h3&gt;

&lt;p&gt;See what the model detects&lt;/p&gt;




&lt;h3&gt;
  
  
  CAM (Class Activation Map)
&lt;/h3&gt;

&lt;p&gt;See what the model uses&lt;/p&gt;




&lt;h3&gt;
  
  
  Real-World Example
&lt;/h3&gt;

&lt;p&gt;Model classifies “cow” correctly.&lt;/p&gt;

&lt;p&gt;CAM shows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Focus on grass, not cow&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Conclusion:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Dataset bias, not model intelligence&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Practical Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;CNNs learn features automatically&lt;/li&gt;
&lt;li&gt;Structure matters more than size&lt;/li&gt;
&lt;li&gt;Depth builds meaning&lt;/li&gt;
&lt;li&gt;Training tricks are critical&lt;/li&gt;
&lt;li&gt;Visualization reveals hidden problems&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Final Thought
&lt;/h2&gt;

&lt;p&gt;CNNs are not just models.&lt;/p&gt;

&lt;p&gt;They encode this idea:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Learn representations, not rules&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;If you’ve worked with CNNs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Did augmentation help more than architecture?&lt;/li&gt;
&lt;li&gt;Have you checked CAM for bias?&lt;/li&gt;
&lt;li&gt;Where did your model actually fail?&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>deeplearning</category>
      <category>machinelearning</category>
      <category>ai</category>
      <category>computervision</category>
    </item>
    <item>
      <title>Neural Network Optimization Challenges — Fixing Vanishing Gradients with Better Architecture Design</title>
      <dc:creator>shangkyu shin</dc:creator>
      <pubDate>Sat, 11 Apr 2026 18:08:01 +0000</pubDate>
      <link>https://forem.com/zeromathai/neural-network-optimization-challenges-fixing-vanishing-gradients-with-better-architecture-design-1gf5</link>
      <guid>https://forem.com/zeromathai/neural-network-optimization-challenges-fixing-vanishing-gradients-with-better-architecture-design-1gf5</guid>
      <description>&lt;p&gt;Vanishing gradients are one of the main reasons deep neural networks fail.&lt;/p&gt;

&lt;p&gt;If your deeper model performs worse than a shallow one, this is usually the cause.&lt;/p&gt;

&lt;p&gt;This post explains what’s happening—and how to fix it in practice.&lt;/p&gt;

&lt;p&gt;Cross-posted from Zeromath. Original article: &lt;a href="https://zeromathai.com/en/optimization-architecture-en/" rel="noopener noreferrer"&gt;https://zeromathai.com/en/optimization-architecture-en/&lt;/a&gt;&lt;/p&gt;




&lt;h1&gt;
  
  
  1. A Real Problem You’ve Probably Seen
&lt;/h1&gt;

&lt;p&gt;You build a deeper model expecting better performance.&lt;/p&gt;

&lt;p&gt;Instead:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;training slows down&lt;/li&gt;
&lt;li&gt;loss stops improving&lt;/li&gt;
&lt;li&gt;accuracy gets worse than a smaller model&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This feels wrong.&lt;/p&gt;

&lt;p&gt;But it’s common.&lt;/p&gt;




&lt;h1&gt;
  
  
  2. The Root Cause: Gradient Flow Collapse
&lt;/h1&gt;

&lt;p&gt;Backpropagation sends gradients backward through layers.&lt;/p&gt;

&lt;p&gt;Each layer multiplies them.&lt;/p&gt;

&lt;p&gt;If those values are small:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;they shrink exponentially&lt;/li&gt;
&lt;li&gt;eventually become ~0&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Result:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;early layers stop learning&lt;/li&gt;
&lt;li&gt;model cannot improve&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  3. Why Sigmoid Breaks Deep Models
&lt;/h1&gt;

&lt;p&gt;Sigmoid looks mathematically clean.&lt;/p&gt;

&lt;p&gt;But in deep networks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;outputs saturate&lt;/li&gt;
&lt;li&gt;derivatives become tiny&lt;/li&gt;
&lt;li&gt;gradients vanish&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;σ(5) ≈ 0.993&lt;/li&gt;
&lt;li&gt;σ′(5) ≈ 0.007&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Stack multiple layers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;(0.007)^10 → effectively zero&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is why deep sigmoid networks fail.&lt;/p&gt;




&lt;h1&gt;
  
  
  4. The First Fix: ReLU
&lt;/h1&gt;

&lt;p&gt;ReLU avoids saturation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;f(x) = max(0, x)&lt;/li&gt;
&lt;li&gt;derivative = 1 (positive region)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Effect:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;gradients survive&lt;/li&gt;
&lt;li&gt;deeper models train&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Variants:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Leaky ReLU → avoids dead neurons&lt;/li&gt;
&lt;li&gt;GELU → smoother behavior (Transformers)&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  5. Depth vs Width (What to Actually Do)
&lt;/h1&gt;

&lt;p&gt;More depth is not always better.&lt;/p&gt;

&lt;p&gt;Deep:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;expressive&lt;/li&gt;
&lt;li&gt;hard to train&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Wide:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;stable&lt;/li&gt;
&lt;li&gt;less hierarchical&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If training fails:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;try adjusting structure, not just size.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h1&gt;
  
  
  6. Skip Connections (Why ResNet Works)
&lt;/h1&gt;

&lt;p&gt;Skip connections add a shortcut:&lt;/p&gt;

&lt;p&gt;x → F(x) + x&lt;/p&gt;

&lt;p&gt;This allows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;gradients to bypass layers&lt;/li&gt;
&lt;li&gt;signal strength to remain intact&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;deep networks degrade&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;deep networks train reliably&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  7. Architecture = Optimization Strategy
&lt;/h1&gt;

&lt;p&gt;Most people try to fix training with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;learning rate tweaks&lt;/li&gt;
&lt;li&gt;optimizer changes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But the real fix is often:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;architecture&lt;/p&gt;
&lt;/blockquote&gt;

&lt;ul&gt;
&lt;li&gt;activation → controls gradients&lt;/li&gt;
&lt;li&gt;depth → increases difficulty&lt;/li&gt;
&lt;li&gt;skip connections → fix gradient flow&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  8. Practical Debug Scenario
&lt;/h1&gt;

&lt;p&gt;If your model:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;gets worse when deeper&lt;/li&gt;
&lt;li&gt;shows near-zero gradients early&lt;/li&gt;
&lt;li&gt;trains very slowly&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;switch to ReLU/GELU&lt;/li&gt;
&lt;li&gt;add skip connections&lt;/li&gt;
&lt;li&gt;reconsider architecture&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  9. Key Insight
&lt;/h1&gt;

&lt;p&gt;If a deeper model performs worse than a shallow one:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;suspect optimization before capacity.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h1&gt;
  
  
  Final Thought
&lt;/h1&gt;

&lt;p&gt;Deep learning is not about stacking layers.&lt;/p&gt;

&lt;p&gt;It’s about preserving learning signals.&lt;/p&gt;

&lt;p&gt;No gradient → no learning&lt;br&gt;&lt;br&gt;
Stable gradient → scalable models  &lt;/p&gt;




&lt;p&gt;What worked for you?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;architecture changes?&lt;/li&gt;
&lt;li&gt;activation tweaks?&lt;/li&gt;
&lt;li&gt;training tricks?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Curious to hear real experiences.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>deeplearning</category>
      <category>machinelearning</category>
      <category>programming</category>
    </item>
    <item>
      <title>How Neural Networks Actually Learn: Backpropagation, Gradients, and Training Loop (Developer Guide)</title>
      <dc:creator>shangkyu shin</dc:creator>
      <pubDate>Sat, 11 Apr 2026 18:04:48 +0000</pubDate>
      <link>https://forem.com/zeromathai/how-neural-networks-actually-learn-backpropagation-gradients-and-training-loop-developer-guide-39p8</link>
      <guid>https://forem.com/zeromathai/how-neural-networks-actually-learn-backpropagation-gradients-and-training-loop-developer-guide-39p8</guid>
      <description>&lt;p&gt;Learn how neural networks train using forward propagation, loss functions, and backpropagation. This developer-focused guide explains gradients, chain rule, and autograd with practical intuition.&lt;/p&gt;

&lt;p&gt;Cross-posted from Zeromath. Original article: &lt;a href="https://zeromathai.com/en/training-signals-back-fundamentals-en/" rel="noopener noreferrer"&gt;https://zeromathai.com/en/training-signals-back-fundamentals-en/&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Real Mechanism
&lt;/h2&gt;

&lt;p&gt;Neural networks don’t “learn” in a human sense.&lt;/p&gt;

&lt;p&gt;They optimize.&lt;/p&gt;

&lt;p&gt;Every step is:&lt;/p&gt;

&lt;p&gt;forward → loss → backward → update&lt;/p&gt;




&lt;h2&gt;
  
  
  Training vs Inference
&lt;/h2&gt;

&lt;p&gt;Training:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;compute loss&lt;/li&gt;
&lt;li&gt;run backward()&lt;/li&gt;
&lt;li&gt;update weights&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Inference:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;forward only&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;No backward pass = no learning&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Two Signals
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Forward → prediction
&lt;/li&gt;
&lt;li&gt;Backward → gradients
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Think:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;forward = what happened
&lt;/li&gt;
&lt;li&gt;backward = how to fix it
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Loss Function (Why Errors Matter)
&lt;/h2&gt;

&lt;p&gt;Binary Cross-Entropy example:&lt;/p&gt;

&lt;p&gt;ŷ = 0.8 → loss ≈ 0.223&lt;br&gt;&lt;br&gt;
ŷ = 0.1 → loss ≈ 2.302  &lt;/p&gt;

&lt;p&gt;Key idea:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Wrong predictions create stronger gradients&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Gradients = Direction
&lt;/h2&gt;

&lt;p&gt;gradient = ∂loss / ∂parameter  &lt;/p&gt;

&lt;p&gt;This tells us how to update weights.&lt;/p&gt;




&lt;h2&gt;
  
  
  Chain Rule (Core)
&lt;/h2&gt;

&lt;p&gt;y = f(g(h(x)))  &lt;/p&gt;

&lt;p&gt;dL/dx = dL/dy · dy/dg · dg/dh · dh/dx  &lt;/p&gt;




&lt;h2&gt;
  
  
  The Only Rule You Need
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;gradient = upstream × local derivative&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h3&gt;
  
  
  Example: Multiplication
&lt;/h3&gt;

&lt;p&gt;z = x * y  &lt;/p&gt;

&lt;p&gt;dL/dx = dL/dz * y&lt;br&gt;&lt;br&gt;
dL/dy = dL/dz * x  &lt;/p&gt;




&lt;h3&gt;
  
  
  Example: Square
&lt;/h3&gt;

&lt;p&gt;out = x²  &lt;/p&gt;

&lt;p&gt;grad = upstream * 2x  &lt;/p&gt;




&lt;h2&gt;
  
  
  Why Autograd Exists
&lt;/h2&gt;

&lt;p&gt;Manual chain rule doesn’t scale.&lt;/p&gt;

&lt;p&gt;Frameworks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;store forward values
&lt;/li&gt;
&lt;li&gt;build computation graph
&lt;/li&gt;
&lt;li&gt;apply backward automatically
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What Happens in Code
&lt;/h2&gt;

&lt;p&gt;y_pred = model(x)&lt;br&gt;&lt;br&gt;
loss = criterion(y_pred, y)  &lt;/p&gt;

&lt;p&gt;loss.backward()&lt;br&gt;&lt;br&gt;
optimizer.step()&lt;br&gt;&lt;br&gt;
optimizer.zero_grad()  &lt;/p&gt;




&lt;h2&gt;
  
  
  Important Implementation Details
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Why zero_grad()?
&lt;/h3&gt;

&lt;p&gt;Gradients accumulate by default.&lt;/p&gt;

&lt;p&gt;Without reset:&lt;/p&gt;

&lt;p&gt;grad_total = grad_step1 + grad_step2 + ...&lt;/p&gt;




&lt;h3&gt;
  
  
  Why backward() first?
&lt;/h3&gt;

&lt;p&gt;Because gradients must exist before updating:&lt;/p&gt;

&lt;p&gt;loss.backward() → gradients computed&lt;br&gt;&lt;br&gt;
optimizer.step() → parameters updated  &lt;/p&gt;




&lt;h3&gt;
  
  
  Why reverse traversal?
&lt;/h3&gt;

&lt;p&gt;Because gradients depend on outputs.&lt;/p&gt;

&lt;p&gt;So computation flows:&lt;/p&gt;

&lt;p&gt;output → input  &lt;/p&gt;




&lt;h2&gt;
  
  
  Computational Graph Intuition
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;forward = build graph
&lt;/li&gt;
&lt;li&gt;backward = traverse graph in reverse
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Intermediate results are reused.&lt;/p&gt;




&lt;h2&gt;
  
  
  Gradient Descent
&lt;/h2&gt;

&lt;p&gt;θ = θ − η ∇L  &lt;/p&gt;

&lt;p&gt;η = learning rate  &lt;/p&gt;




&lt;h2&gt;
  
  
  Final Takeaway
&lt;/h2&gt;

&lt;p&gt;Neural networks learn by:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;propagating error backward and updating parameters  &lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That’s the entire system.&lt;/p&gt;




&lt;p&gt;What helped you understand backprop the most—math, visualization, or code? Let’s discuss 👇&lt;/p&gt;

</description>
      <category>ai</category>
      <category>deeplearning</category>
      <category>machinelearning</category>
      <category>programming</category>
    </item>
    <item>
      <title>Output Layer Explained — Logits, Softmax, Cross-Entropy, and Why They Work Together</title>
      <dc:creator>shangkyu shin</dc:creator>
      <pubDate>Sat, 11 Apr 2026 17:59:09 +0000</pubDate>
      <link>https://forem.com/zeromathai/output-layer-explained-logits-softmax-cross-entropy-and-why-they-work-together-17al</link>
      <guid>https://forem.com/zeromathai/output-layer-explained-logits-softmax-cross-entropy-and-why-they-work-together-17al</guid>
      <description>&lt;p&gt;Neural networks don’t output decisions — they output probabilities.&lt;/p&gt;

&lt;p&gt;This post explains how logits, softmax, and cross-entropy turn raw outputs into meaningful predictions in deep learning.&lt;/p&gt;

&lt;p&gt;Cross-posted from Zeromath. Original article: &lt;a href="https://zeromathai.com/en/output-layer-probabilistic-interpretation-en/" rel="noopener noreferrer"&gt;https://zeromathai.com/en/output-layer-probabilistic-interpretation-en/&lt;/a&gt;&lt;/p&gt;




&lt;h1&gt;
  
  
  The Real Role of the Output Layer
&lt;/h1&gt;

&lt;p&gt;A neural network doesn’t directly say:&lt;/p&gt;

&lt;p&gt;“This is class A.”&lt;/p&gt;

&lt;p&gt;Instead, it computes:&lt;/p&gt;

&lt;p&gt;A probability distribution over all classes.&lt;/p&gt;




&lt;h1&gt;
  
  
  Step 1 — Logits (Raw Scores)
&lt;/h1&gt;

&lt;p&gt;Final layer:&lt;/p&gt;

&lt;p&gt;ŷ = Wh + b&lt;/p&gt;

&lt;p&gt;Output:&lt;/p&gt;

&lt;p&gt;z = logits&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Not probabilities&lt;/li&gt;
&lt;li&gt;Not normalized&lt;/li&gt;
&lt;li&gt;Can be negative or large&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;p&gt;[0.4, -1.7, 4.2]&lt;/p&gt;




&lt;h1&gt;
  
  
  Step 2 — Softmax (Make It Probabilistic)
&lt;/h1&gt;

&lt;p&gt;softmax(z_i) = exp(z_i) / Σ exp(z_j)&lt;/p&gt;

&lt;p&gt;Transforms:&lt;/p&gt;

&lt;p&gt;[0.4, -1.7, 4.2]&lt;br&gt;&lt;br&gt;
→ [0.022, 0.003, 0.975]&lt;/p&gt;

&lt;p&gt;Now outputs are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;positive&lt;/li&gt;
&lt;li&gt;sum to 1&lt;/li&gt;
&lt;li&gt;interpretable&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  Step 3 — Argmax (Decision)
&lt;/h1&gt;

&lt;ul&gt;
&lt;li&gt;Softmax → probabilities&lt;/li&gt;
&lt;li&gt;Argmax → final class&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Important:&lt;/p&gt;

&lt;p&gt;Softmax keeps uncertainty&lt;br&gt;&lt;br&gt;
Argmax removes it&lt;/p&gt;




&lt;h1&gt;
  
  
  Step 4 — Training Uses Cross-Entropy
&lt;/h1&gt;

&lt;p&gt;Loss:&lt;/p&gt;

&lt;p&gt;− log(p_true_class)&lt;/p&gt;

&lt;p&gt;Why:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;differentiable&lt;/li&gt;
&lt;li&gt;punishes confident mistakes&lt;/li&gt;
&lt;li&gt;aligns with probability theory&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  Step 5 — Why Frameworks Use Logits Directly
&lt;/h1&gt;

&lt;p&gt;In PyTorch / TensorFlow:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CrossEntropyLoss expects logits&lt;/li&gt;
&lt;li&gt;NOT softmax output&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Why?&lt;/p&gt;

&lt;p&gt;Numerical stability:&lt;/p&gt;

&lt;p&gt;log(softmax(z)) is computed safely without overflow&lt;/p&gt;




&lt;h1&gt;
  
  
  Step 6 — Softmax vs Sigmoid (Real-World Bug Source)
&lt;/h1&gt;

&lt;ul&gt;
&lt;li&gt;Binary → sigmoid&lt;/li&gt;
&lt;li&gt;Multi-class → softmax&lt;/li&gt;
&lt;li&gt;Multi-label → sigmoid per class&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Common bug:&lt;/p&gt;

&lt;p&gt;Using softmax for multi-label → wrong behavior&lt;/p&gt;




&lt;h1&gt;
  
  
  Step 7 — Inference Tip
&lt;/h1&gt;

&lt;p&gt;Do you always need softmax?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;For prediction only → argmax(logits) works&lt;/li&gt;
&lt;li&gt;For probabilities → apply softmax&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This saves computation in production systems.&lt;/p&gt;




&lt;h1&gt;
  
  
  Mental Model
&lt;/h1&gt;

&lt;p&gt;Input → Features → Logits → Softmax → Probabilities → Argmax → Prediction&lt;/p&gt;




&lt;h1&gt;
  
  
  Debugging Checklist
&lt;/h1&gt;

&lt;ul&gt;
&lt;li&gt;Overconfident wrong → calibration issue&lt;/li&gt;
&lt;li&gt;Always low confidence → weak features&lt;/li&gt;
&lt;li&gt;Loss not decreasing → output/loss mismatch&lt;/li&gt;
&lt;li&gt;Multi-label broken → wrong activation&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  Final Takeaway
&lt;/h1&gt;

&lt;ul&gt;
&lt;li&gt;Output layer → scores&lt;/li&gt;
&lt;li&gt;Softmax → probabilities&lt;/li&gt;
&lt;li&gt;Argmax → decisions&lt;/li&gt;
&lt;li&gt;Cross-entropy → learning&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Deep learning works because:&lt;/p&gt;

&lt;p&gt;It models uncertainty, not just outputs.&lt;/p&gt;




&lt;p&gt;Where do you usually get stuck — logits, softmax, or loss functions?&lt;/p&gt;

</description>
      <category>ai</category>
      <category>deeplearning</category>
      <category>machinelearning</category>
      <category>neuralnetworks</category>
    </item>
    <item>
      <title>Multilayer Perceptron (MLP): A Practical Way to Understand Neural Networks</title>
      <dc:creator>shangkyu shin</dc:creator>
      <pubDate>Sat, 11 Apr 2026 17:37:29 +0000</pubDate>
      <link>https://forem.com/zeromathai/multilayer-perceptron-mlp-a-practical-way-to-understand-neural-networks-3hic</link>
      <guid>https://forem.com/zeromathai/multilayer-perceptron-mlp-a-practical-way-to-understand-neural-networks-3hic</guid>
      <description>&lt;p&gt;Multilayer Perceptrons (MLPs) are the foundation of deep learning. This guide explains MLP intuition, real-world usage, and when you should (and shouldn’t) use it.&lt;/p&gt;

&lt;p&gt;Cross-posted from Zeromath. Original article: &lt;a href="https://zeromathai.com/en/mlp-intuition-components-en/" rel="noopener noreferrer"&gt;https://zeromathai.com/en/mlp-intuition-components-en/&lt;/a&gt;&lt;/p&gt;




&lt;h1&gt;
  
  
  MLP = A Function (Not Layers)
&lt;/h1&gt;

&lt;p&gt;Most people think neural networks are stacks of layers.&lt;/p&gt;

&lt;p&gt;They are wrong.&lt;/p&gt;

&lt;p&gt;An MLP is:&lt;/p&gt;

&lt;p&gt;y = f(x; θ)&lt;/p&gt;

&lt;p&gt;👉 A learnable function.&lt;/p&gt;




&lt;h1&gt;
  
  
  Start Simple
&lt;/h1&gt;

&lt;p&gt;z = wᵀx + b&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;works for simple problems
&lt;/li&gt;
&lt;li&gt;fails for nonlinear patterns
&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  Add Nonlinearity → Neural Network
&lt;/h1&gt;

&lt;p&gt;a = σ(wᵀx + b)&lt;/p&gt;

&lt;p&gt;Now you can model:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;nonlinear relationships
&lt;/li&gt;
&lt;li&gt;feature interactions
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;👉 This is where deep learning starts.&lt;/p&gt;




&lt;h1&gt;
  
  
  Core Building Block
&lt;/h1&gt;

&lt;p&gt;Each neuron:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;linear transform
&lt;/li&gt;
&lt;li&gt;activation
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Stack them → model.&lt;/p&gt;




&lt;h1&gt;
  
  
  Example
&lt;/h1&gt;

&lt;p&gt;x = (1, 2)&lt;br&gt;&lt;br&gt;
w = (0.5, -1)&lt;br&gt;&lt;br&gt;
b = 0.1  &lt;/p&gt;

&lt;p&gt;z = -1.4  &lt;/p&gt;

&lt;p&gt;Then activation decides output.&lt;/p&gt;




&lt;h1&gt;
  
  
  Layers
&lt;/h1&gt;

&lt;p&gt;Each layer:&lt;/p&gt;

&lt;p&gt;x → Wx + b → activation  &lt;/p&gt;

&lt;p&gt;Stack:&lt;/p&gt;

&lt;p&gt;input → hidden → output  &lt;/p&gt;




&lt;h1&gt;
  
  
  Why Depth Works
&lt;/h1&gt;

&lt;p&gt;Instead of learning everything at once:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Layer 1 → simple features
&lt;/li&gt;
&lt;li&gt;Layer 2 → combinations
&lt;/li&gt;
&lt;li&gt;Layer 3 → abstractions
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;👉 Deep learning = function composition&lt;/p&gt;




&lt;h1&gt;
  
  
  When to Use MLP (Real Use Cases)
&lt;/h1&gt;

&lt;p&gt;Use MLP when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;tabular datasets (very common in industry)
&lt;/li&gt;
&lt;li&gt;structured features (e.g. finance, logs, metrics)
&lt;/li&gt;
&lt;li&gt;baseline model before complex architectures
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;👉 In many real projects, MLP is the first model you try.&lt;/p&gt;




&lt;h1&gt;
  
  
  When NOT to Use MLP
&lt;/h1&gt;

&lt;p&gt;Avoid MLP when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;images → use CNN
&lt;/li&gt;
&lt;li&gt;sequences → use RNN / Transformer
&lt;/li&gt;
&lt;li&gt;structure matters
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;👉 MLP assumes features are independent.&lt;/p&gt;




&lt;h1&gt;
  
  
  Practical Comparison
&lt;/h1&gt;

&lt;p&gt;MLP:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;good for tabular data
&lt;/li&gt;
&lt;li&gt;assumes no structure
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;CNN:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;good when nearby pixels matter
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Transformer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;good when relationships matter globally
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;👉 Choose model based on data structure.&lt;/p&gt;




&lt;h1&gt;
  
  
  Minimal PyTorch Example
&lt;/h1&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
python
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(10, 32),  # 10 input features
    nn.ReLU(),
    nn.Linear(32, 1)    # regression output
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>ai</category>
      <category>deeplearning</category>
      <category>machinelearning</category>
      <category>programming</category>
    </item>
  </channel>
</rss>
