<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Andrea</title>
    <description>The latest articles on Forem by Andrea (@lattanzi).</description>
    <link>https://forem.com/lattanzi</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3817575%2F5eb67dce-de24-432b-a21c-0fa8769ea030.png</url>
      <title>Forem: Andrea</title>
      <link>https://forem.com/lattanzi</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/lattanzi"/>
    <language>en</language>
    <item>
      <title>How I automated text dataset cleaning for ML training with AI (and why regex wasn't enough)</title>
      <dc:creator>Andrea</dc:creator>
      <pubDate>Tue, 10 Mar 2026 23:22:33 +0000</pubDate>
      <link>https://forem.com/lattanzi/how-i-automated-text-dataset-cleaning-for-ml-training-with-ai-and-why-regex-wasnt-enough-3h19</link>
      <guid>https://forem.com/lattanzi/how-i-automated-text-dataset-cleaning-for-ml-training-with-ai-and-why-regex-wasnt-enough-3h19</guid>
      <description>&lt;h2&gt;
  
  
  The problem nobody talks about
&lt;/h2&gt;

&lt;p&gt;Every ML engineer knows the principle: garbage in, garbage out. But somehow, most teams still spend weeks manually cleaning text data before training — or worse, they skip the cleaning and wonder why their model underperforms.&lt;/p&gt;

&lt;p&gt;I've been working with text datasets for years, and the pattern is always the same. You get data from a CRM, an ERP, scanned documents, web scraping, automated feeds. And the data looks &lt;em&gt;mostly&lt;/em&gt; fine. Until you look closely.&lt;/p&gt;

&lt;p&gt;Double spaces everywhere. Punctuation that's technically Unicode but renders wrong. Words repeated in sequence ("the the company"). Apostrophes that are sometimes &lt;code&gt;'&lt;/code&gt;, sometimes &lt;code&gt;'&lt;/code&gt;, sometimes &lt;code&gt;´&lt;/code&gt;. Capitalization that changes mid-sentence. Encoding artifacts from a database migration five years ago.&lt;/p&gt;

&lt;p&gt;Each error seems harmless. Multiply them by a million records, and your model is learning noise as signal.&lt;/p&gt;

&lt;h2&gt;
  
  
  "Just write a regex"
&lt;/h2&gt;

&lt;p&gt;Sure, for some things. Lowercase everything? &lt;code&gt;text.lower()&lt;/code&gt;. Strip HTML tags? Easy regex. Remove double spaces? &lt;code&gt;re.sub(r' +', ' ', text)&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;But what about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OCR artifacts that vary from page to page ("rn" → "m" sometimes, not always)&lt;/li&gt;
&lt;li&gt;Free-text customer notes where every record is different&lt;/li&gt;
&lt;li&gt;Typos that aren't in any standard dictionary&lt;/li&gt;
&lt;li&gt;Mixed-language text where the rules change mid-field&lt;/li&gt;
&lt;li&gt;Encoding errors interleaved with valid special characters&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These require &lt;em&gt;judgment&lt;/em&gt;. A human can spot them instantly but can't process 100K records. A regex can process 100K records but can't make judgment calls.&lt;/p&gt;

&lt;p&gt;This is exactly where LLMs excel.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I built
&lt;/h2&gt;

&lt;p&gt;PurifyFactory is a CLI pipeline that uses AI language models to clean text datasets at scale. The workflow is deliberately simple:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Split your JSONL dataset into optimal batches&lt;/span&gt;
./purifyfactory &lt;span class="nb"&gt;split&lt;/span&gt; &lt;span class="nt"&gt;--input&lt;/span&gt; messy_data.jsonl &lt;span class="nt"&gt;--config&lt;/span&gt; my_config.json

&lt;span class="c"&gt;# 2. Queue the work&lt;/span&gt;
./purifyfactory orchestrate &lt;span class="nt"&gt;--config&lt;/span&gt; my_config.json

&lt;span class="c"&gt;# 3. Process with AI (parallel workers, auto-recovery)&lt;/span&gt;
./purifyfactory process &lt;span class="nt"&gt;--config&lt;/span&gt; my_config.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The output is a JSONL file with original and cleaned text side by side:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"original_text"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"The  company's  product was  very very  popular"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"cleaned_text"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"The company's product was very popular"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"provider"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"openai"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"tokens"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;45&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"cost"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.000010&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  How it actually works
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;You define the rules.&lt;/strong&gt; Your cleaning logic lives in the system prompt — natural language instructions that the AI applies consistently to every record. "Remove duplicate words. Fix punctuation. Normalize apostrophes to standard Unicode. Correct obvious OCR errors."&lt;/p&gt;

&lt;p&gt;The key insight: what takes a human 10 seconds per record and is impossible to scale, takes the LLM milliseconds and costs fractions of a cent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Architecture:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Split&lt;/strong&gt;: Your dataset gets chunked into optimal batch sizes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Orchestrate&lt;/strong&gt;: Batches are queued for parallel processing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Process&lt;/strong&gt;: Multiple workers process batches simultaneously in parallel. Failed batches can be automatically recovered when the supervisor daemon is running, or re-queued manually with a single command&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Multi-provider&lt;/strong&gt;: Works with OpenAI, Anthropic Claude, Google Gemini, or local models via Ollama/vLLM. Switch providers by changing one line in the config. Automatic fallback if a provider fails.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;On-premise&lt;/strong&gt;: The binary runs entirely on your machine. Your data never touches any server except the API calls to your chosen provider. Essential for sensitive corporate datasets.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost tracking&lt;/strong&gt;: Every record in the output includes token count and cost. The &lt;code&gt;report&lt;/code&gt; command gives you total cost, average cost per record, and processing time. Estimated cost for 10,000 records with gpt-4o-mini: $0.50–1.50 depending on average text length.&lt;/p&gt;

&lt;h2&gt;
  
  
  Estimated costs
&lt;/h2&gt;

&lt;p&gt;Costs vary with average text length. Reference estimates based on provider pricing:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dataset size&lt;/th&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;Estimated cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1,000 records&lt;/td&gt;
&lt;td&gt;gpt-4o-mini&lt;/td&gt;
&lt;td&gt;~$0.05–0.15&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10,000 records&lt;/td&gt;
&lt;td&gt;gpt-4o-mini&lt;/td&gt;
&lt;td&gt;~$0.50–1.50&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10,000 records&lt;/td&gt;
&lt;td&gt;claude-haiku-4-5&lt;/td&gt;
&lt;td&gt;~$0.40–1.20&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The cost-per-quality tradeoff is remarkable. A model trained on 5K ultra-clean records consistently outperforms one trained on 10K messy records — you're potentially saving days of post-training debugging for just a few dollars in API costs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Open beta
&lt;/h2&gt;

&lt;p&gt;PurifyFactory is now in open beta (v9.1.6). Currently Linux x86_64 only — Windows and macOS are on the roadmap.&lt;/p&gt;

&lt;p&gt;If you work with text datasets and want to try it, you can apply for the beta program here:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://purifyfactory.com" rel="noopener noreferrer"&gt;https://purifyfactory.com&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You'll need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Linux x86_64&lt;/li&gt;
&lt;li&gt;An API key from OpenAI, Anthropic, Google Gemini, or a local model setup&lt;/li&gt;
&lt;li&gt;A dataset you'd like to clean (1,000+ records recommended)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Free access, direct feedback channel with the dev team, and your input shapes the final product.&lt;/p&gt;

&lt;p&gt;Built by &lt;a href="https://mentoratechnologies.com" rel="noopener noreferrer"&gt;Mentora Technologies&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;What's your experience with text data quality in ML pipelines? I'd love to hear how others are handling this — especially at scale.&lt;/em&gt;&lt;/p&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>machinelearning</category>
      <category>nlp</category>
      <category>python</category>
      <category>ai</category>
    </item>
  </channel>
</rss>
