<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Kirill Sokol</title>
    <description>The latest articles on Forem by Kirill Sokol (@malkiel).</description>
    <link>https://forem.com/malkiel</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3838782%2F09aa9a67-8e84-4ac6-b79e-a4de988c003e.jpg</url>
      <title>Forem: Kirill Sokol</title>
      <link>https://forem.com/malkiel</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/malkiel"/>
    <language>en</language>
    <item>
      <title>We Trained a Skin analysis AI Model on Millions of Real Photos — What Actually Works in Production</title>
      <dc:creator>Kirill Sokol</dc:creator>
      <pubDate>Mon, 30 Mar 2026 13:31:09 +0000</pubDate>
      <link>https://forem.com/malkiel/we-trained-a-skin-analysis-ai-model-on-millions-of-real-photos-what-actually-works-in-production-4a80</link>
      <guid>https://forem.com/malkiel/we-trained-a-skin-analysis-ai-model-on-millions-of-real-photos-what-actually-works-in-production-4a80</guid>
      <description>&lt;p&gt;Over the past few years, we’ve been building a mobile-first &lt;a href="https://skinive.com" rel="noopener noreferrer"&gt;AI skin analysis&lt;/a&gt; system used by +1.000.000 users worldwide (except the USA and Canada). Unlike most research setups, this system operates on real-world smartphone images — not clinical data, but noisy, user-generated photos taken in uncontrolled conditions.&lt;/p&gt;

&lt;p&gt;To date, we’ve processed millions of images, with a curated subset of a few hundred thousand used for training. A fixed validation set of ~27,000 real-world images has been used to track performance consistently across model versions.&lt;/p&gt;

&lt;p&gt;This article isn’t about building a model from scratch. It’s about what actually works when you try to improve one in production — over years, not weeks.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. A Fixed Validation Set Is More Valuable Than a Bigger One
&lt;/h2&gt;

&lt;p&gt;One of the most important decisions we made was also one of the least exciting.&lt;/p&gt;

&lt;p&gt;We stopped updating our validation dataset.&lt;/p&gt;

&lt;p&gt;Every model version was evaluated on the same ~27k real-world images. No rebalancing, no cleaning, no improvements.&lt;/p&gt;

&lt;p&gt;This made progress slower — but more honest.&lt;/p&gt;

&lt;p&gt;When metrics improved, we knew it wasn’t because the test data got easier. It was because the model actually got better.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F08x9i5tysflwrjeybjeq.png" alt="Skinive AI Accuracy metrics" width="800" height="448"&gt;
&lt;/h2&gt;

&lt;h2&gt;
  
  
  2. More Data Stops Helping Faster Than You Think
&lt;/h2&gt;

&lt;p&gt;We assumed that scaling data would continuously improve performance.&lt;/p&gt;

&lt;p&gt;It didn’t.&lt;/p&gt;

&lt;p&gt;Once we reached millions of images, the marginal gain from additional data dropped significantly.&lt;/p&gt;

&lt;p&gt;The real improvement came from filtering:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;removing low-quality images&lt;/li&gt;
&lt;li&gt;reducing redundancy&lt;/li&gt;
&lt;li&gt;increasing representation of rare but important cases&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In practice, a curated subset of a few hundred thousand images was more useful than the full dataset.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftgb6zj9c30kn9vd8ezj1.png" alt="Skin conditions dataset" width="800" height="451"&gt;
&lt;/h2&gt;

&lt;h2&gt;
  
  
  3. Garbage In, Garbage Out — So We Filter Before Inference
&lt;/h2&gt;

&lt;p&gt;One thing we underestimated early on was how much of the input wouldn’t even be valid.&lt;br&gt;
Not just low-quality images — completely irrelevant ones.&lt;br&gt;
Users upload everything: blurred frames, partial shots, or images that don’t contain any useful signal at all. In practice, around 30–40% of raw user uploads had to be filtered out before reaching the model.&lt;/p&gt;

&lt;p&gt;Instead of trying to make the model robust to everything, we introduced a preprocessing pipeline.&lt;/p&gt;

&lt;p&gt;On-device, we run a lightweight object detector (initially YOLO-based, later replaced with a more optimized version) to localize regions of interest and automatically crop the relevant area. This helps standardize inputs without requiring perfect user behavior.&lt;br&gt;
On the backend, we apply an additional relevance check. If an image doesn’t appear to contain skin, we don’t process it further and instead prompt the user to retake the photo.&lt;/p&gt;

&lt;p&gt;For borderline cases, we attempt basic enhancement — denoising, sharpening, contrast adjustments. If the image becomes usable, it proceeds through the pipeline. If not, it is discarded.&lt;br&gt;
This step alone significantly improved the overall system reliability — not by changing the model, but by improving the input.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Real-World Images Break Simplified Assumptions
&lt;/h2&gt;

&lt;p&gt;Most models are trained on clean, well-centered images.&lt;/p&gt;

&lt;p&gt;Real users don’t behave like datasets.&lt;/p&gt;

&lt;p&gt;Photos can include multiple objects, poor framing, inconsistent lighting, or irrelevant content. Treating the entire image as a single input often leads to unstable behavior.&lt;/p&gt;

&lt;p&gt;Moving toward detection-based approaches — where the model focuses on specific regions — significantly improved real-world performance.&lt;/p&gt;

&lt;p&gt;Not because it improved benchmarks immediately, but because it aligned the system with reality.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Optimizing One Metric Can Hurt the Product
&lt;/h2&gt;

&lt;p&gt;Early versions of the model prioritized sensitivity.&lt;/p&gt;

&lt;p&gt;This reduced missed cases — but increased false positives.&lt;/p&gt;

&lt;p&gt;From a metrics perspective, this looked like progress.&lt;/p&gt;

&lt;p&gt;From a product perspective, it created friction.&lt;/p&gt;

&lt;p&gt;Over time, improving precision became just as important. The goal shifted from “detect everything” to “provide useful and trustworthy outputs.”&lt;/p&gt;

&lt;p&gt;The key lesson:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Model quality is not defined by a single metric — but by how metrics interact.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  6. Better Models Don’t Always Win
&lt;/h2&gt;

&lt;p&gt;We experimented with multiple architectures over time.&lt;/p&gt;

&lt;p&gt;Some were more advanced. Some performed better in controlled settings.&lt;/p&gt;

&lt;p&gt;But the biggest gains didn’t come from model upgrades.&lt;/p&gt;

&lt;p&gt;They came from:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;better data selection&lt;/li&gt;
&lt;li&gt;more consistent labeling&lt;/li&gt;
&lt;li&gt;stable evaluation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In several cases, a simpler model trained on better data outperformed a more complex one trained on everything.&lt;/p&gt;




&lt;p&gt;*&lt;em&gt;Diversity of real-world data *&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff43ujrd58jmrapupyacb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff43ujrd58jmrapupyacb.png" alt="Skinive app users: world distribution" width="800" height="472"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7uy9r6mck78guf0qzrto.png" alt="Skinive app users: Skin colour distribution" width="800" height="455"&gt;
&lt;/h2&gt;

&lt;h2&gt;
  
  
  What Actually Makes a Production Model Better
&lt;/h2&gt;

&lt;p&gt;Looking back, the improvements came from a combination of decisions that are often overlooked:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;keeping evaluation consistent&lt;/li&gt;
&lt;li&gt;focusing on data quality instead of volume&lt;/li&gt;
&lt;li&gt;aligning the model with real-world inputs&lt;/li&gt;
&lt;li&gt;balancing metrics instead of maximizing one&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of these are particularly novel.&lt;/p&gt;

&lt;p&gt;But together, they made the system significantly more reliable.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Thought
&lt;/h2&gt;

&lt;p&gt;If you’re building models on real-world user data — especially from mobile devices — your biggest challenge isn’t training.&lt;/p&gt;

&lt;p&gt;It’s making sure your improvements are real.&lt;/p&gt;




&lt;h2&gt;
  
  
  One Open Question
&lt;/h2&gt;

&lt;p&gt;One thing we’re still actively thinking about is where the optimal balance actually lies.&lt;/p&gt;

&lt;p&gt;Should a system prioritize detecting as much as possible?&lt;/p&gt;

&lt;p&gt;Or should it prioritize being trusted by the user?&lt;/p&gt;

&lt;p&gt;In our experience, those two goals are not always aligned.&lt;/p&gt;




&lt;h2&gt;
  
  
  Full Breakdown &amp;amp; Demo
&lt;/h2&gt;

&lt;p&gt;We published a full breakdown of the dataset, validation setup, and model evolution (with charts and metrics) here:&lt;br&gt;
👉 &lt;a href="https://skinive.com/skinive-accuracy2026/" rel="noopener noreferrer"&gt;https://skinive.com/skinive-accuracy2026/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you want to see how this works in practice in a real skin analysis app, there’s also a demo available here:&lt;br&gt;
👉 &lt;a href="https://skinive.com/get-skinive/" rel="noopener noreferrer"&gt;https://skinive.com/get-skinive/&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>computervision</category>
      <category>startup</category>
    </item>
  </channel>
</rss>
