From a Basic U-Net to a Robust SEM Denoiser

Hadasa E — Mon, 08 Dec 2025 13:53:53 +0000

When I first approached the problem of denoising Scanning Electron Microscope (SEM) images, I assumed the key would be choosing the right architecture.

I thought in terms of layers, depth, and parameters.

After some time of research and experiments, I realized the real question is different:

How do you translate an engineer’s intuition into a mathematical objective?

In other words: how do I tell the model what must be preserved (tiny defects) and what it’s allowed to remove (scan noise)?

If you’ve ever been asked to “just do denoising for SEM images” or to improve an existing model and the data + noise feel like a black box, this post is for you.

It walks through the process I went through – from a naïve baseline to a more robust system built around custom loss functions, which is now integrated into a full application.

This work was done as part of a joint Applied Materials & Extra-Tech bootcamp project.
I’d like to thank our mentors Roman Kris and Mor Baram from Applied Materials for their guidance, technical insights, and thoughtful feedback throughout the project, and our instructors Shmuel Fine and Sara Shimon from Extra-Tech for their support and teaching.

Together with the team, we built an end-to-end SEM denoising system – from datasets and classical denoising baselines to deep-learning models and a production-style application. In this post I’ll focus on the deep-learning denoisers I worked on: both designing and training the models, and integrating them into the backend API and desktop client.

SEM-based inspection sits in the critical path of semiconductor manufacturing. If your denoiser removes real defects or leaves too much noise, you’re directly affecting yield, false alarms, and engineers’ trust in the system. That’s why “just denoising” SEM images isn’t a cosmetic enhancement — it’s a core part of the inspection pipeline.

The challenge: when noise is part of the signal

SEM (Scanning Electron Microscope) images create a very specific computer-vision challenge:

Low SNR – the noise is not a “nice” Gaussian; it comes from the physics of the scan.
Edge sensitivity – unlike natural images, in wafers every tiny line or texture change may indicate a critical defect in the manufacturing process.
The trade-off
- Too much smoothing → you lose defects (false negatives).
- Too little denoising → the remaining noise makes it hard for algorithms and engineers to detect real issues.

The goal was to build a model that removes noise aggressively enough to make analysis easier, but protects fine structures and critical defects as much as possible.

Step 1 – Baseline U-Net and the limits of MSE

I started with a fairly standard setup to get a benchmark:

Architecture: vanilla U-Net.
Dataset: Tin-balls – ~50 pairs of noisy + clean images, a small and relatively clean dataset.
Loss function: MSE + SSIM:

\mathcal{L}_{\text{baseline}} = \mathrm{MSE}(x, y) + \lambda \cdot \bigl(1 - \mathrm{SSIM}(x, y)\bigr)

The metrics looked “okay”, but visually something was off:

the model tended to smear sharp edges.

MSE (Mean Squared Error) heavily punishes large errors (outliers), which encourages the model to “average out” local extremes.

Instead of reconstructing the true texture, it prefers a slightly blurred version that is safer from the loss perspective.

This baseline was still very important: it provided a stable reference point to measure every later improvement against.

Source: https://iopscience.iop.org/article/10.1088/1361-6501/ad7e41

All experiments were implemented in PyTorch, with training and evaluation scripts sharing the same config-driven codebase.

Step 2 – Rethinking the problem: moving to Charbonnier loss

Once the baseline worked “fine on paper” but clearly blurred edges and defects, I stopped trying to make the model bigger and asked a different question:

Where exactly is the model wrong?

Looking at examples, difference maps and metrics, I noticed a recurring pattern:

The model reacted too aggressively to single noisy pixels
It was willing to oversmooth entire regions just to reduce a few large local errors

So the problem was not only the model – it was how we defined “error” in the loss.

At this point I went looking for a loss function that would better match SEM noise:

Less sensitive to outliers
Better at preserving textures and edges
Still smooth and differentiable for training deep networks

After comparing several options, Charbonnier loss stood out:

It behaves similarly to ( L_1 ) (more robust than MSE),
but is smooth and differentiable everywhere:

\mathcal{L}_{\text{char}}(x, y) = \sqrt{(x - y)^2 + \epsilon^2}

I re-trained the same U-Net, on the same Tin-balls dataset, with the same training setup –

the only change was replacing the MSE term with Charbonnier.

After this change:

PSNR and SSIM improved consistently
- PSNR improved from 27.4 dB to 30.7 dB
- SSIM increased from 0.83 to 0.88
Visually, edges looked more natural, with much less over-smoothing around defects

You can see that the fine-tuned model consistently improves all metrics over the baseline, not just PSNR/SSIM.

This was the first mindset shift in the project:

Instead of “make the model bigger”, start with designing the right objective for the model to optimize.

Step 3 – Cracking wafers: adding structure and edge awareness

When I moved to the more complex Wafers dataset — roughly 1,000 wafer SEM image pairs with high-intensity, physically-inspired synthetic noise added on top of clean references — the requirements changed again.

Here, periodic patterns and tiny defects matter even more.

To help the model “understand” that, I upgraded the loss with two additional components:

MS-SSIM (Multi-Scale SSIM)

Instead of measuring similarity at a single resolution, MS-SSIM looks at multiple scales.

This helps the model preserve both:
- global structure (macro), and
- fine details (micro).
Edge-aware loss

I added a term based on image gradients (using a Sobel operator) that penalizes the model when it breaks or smears edges that exist in the input image.

The full loss became:

\mathcal{L}{\text{total}} = \mathcal{L}{\text{char}} + \alpha \cdot \mathcal{L}{\text{MS-SSIM}} + \beta \cdot \mathcal{L}{\text{edge}}

The result was a clear jump in quality:

Defects stayed sharp and visible on a clean background
The model outperformed classical denoising methods such as BM3D or bilateral filters
Both metrics and visual inspection aligned much better with what domain experts expected to see

What didn’t work (and what I learned from it)

A big part of this project was testing ideas that didn’t make it into production.

They were still valuable – each one refined my understanding of the problem.

Residual prediction

Idea: predict the noise (noisy − clean) instead of the clean image itself.

Motivation: the model might find it easier to focus purely on the noise component and avoid touching edges and defects. This is common in many denoising architectures.

In practice, with SEM noise:

Training became less stable
I saw visual artifacts, especially around edges and defect regions

Conclusion: residual learning is not automatically a win – especially when the noise structure is complex and ground truth is limited.

Deeper U-Net

Idea: increase the model capacity (more depth) to better capture complex patterns and spatially varying noise, especially on wafers.

Expectation: better performance on the harder datasets.

Reality:

Training and inference time increased significantly
No clear, consistent improvement in metrics or visual quality

Lesson learned: more capacity is not always more value.

With a well-designed loss, the original architecture was already “good enough”, and adding depth didn’t justify the extra cost.

Edge map as an extra input channel

Idea: explicitly provide an edge map (e.g., Sobel) as an additional input channel, hoping the model would pay more attention to boundaries and defects from the first layer.

Once I switched to a loss that already included an edge-aware term, this turned out to be unnecessary:

The model learned to emphasize edges directly from the raw image
There was no real gain in metrics or visual quality
The input pipeline became more complex for no benefit

This reinforced the idea that it’s often better to encode our priorities in the loss, not just in the inputs.

Benchmarking against classical methods

To validate the approach, I compared the U-Net model against
established classical denoisers on the same test set:

The deep learning model clearly outperforms the classical methods in terms of
quality metrics.

However, the story is more nuanced when you factor in speed and deployment:

Classical denoisers are simple to deploy, run on CPU, and require no training data or GPUs, which makes them attractive for quick experiments or low-resource environments.
At the same time, they are typically tuned to relatively simple noise models and often struggle with the complex, structured noise in SEM images, forcing a compromise between over-smoothing and under-denoising.
Our U-Net offers the best quality, but requires:
- Initial training time and labeled noisy/clean pairs
- A GPU for real-time processing (or runs ~2–3× slower on CPU)

The trade-off:

For high-throughput production lines where quality is critical and
infrastructure exists → deep learning wins.

For prototyping or lower-volume scenarios → classical methods remain practical.

Engineering it for production

A good model is not enough if it only lives in a notebook.

From the start, the project was built with production in mind:

Modularity – clear separation between data loaders, model definitions, metrics, and experiment code.
Config-driven experiments – every run is defined by a config file, which makes experiments reproducible and makes it easy to tweak hyper-parameters and loss components.
Evaluation framework – built a Python-based evaluation pipeline using PSNR, SSIM, and several custom metrics, with automated reporting and visual side-by-side comparisons for all methods (baseline, classical denoisers, and U-Net variants).
Backend & storage – developed a FastAPI backend, containerized with Docker, using PostgreSQL for metadata/metrics and MinIO as an object store for all noisy/clean/denoised images.
Desktop client – created a PyQt desktop application that talks to the API and lets users interactively compare methods (classical vs. deep learning) both visually and via metrics, image by image or over entire runs.

As you can see, the desktop client lets engineers scroll through wafers, switch between classical and deep-learning models, and inspect both the images and their metrics side by side.

This turned the project from “just a research notebook” into a reproducible, debuggable system that other engineers can run, extend, and plug into existing SEM analysis workflows.

Takeaways

This project taught me that in deep learning, problem definition is just as important as model design.

Shifting the focus from “which architecture should I use?” to

“what exactly do I want the model to optimize?” – via careful loss engineering –

is what allowed us to reach strong performance while preserving the tiny, critical defects that other methods tended to erase.

In the end, a powerful model is not just about using the latest tools.

It’s about combining them with a deep understanding of the data and the domain – and turning that understanding into the right objective for the model to learn.

Forem: Hadasa E