<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Alex Marfo Appiah</title>
    <description>The latest articles on Forem by Alex Marfo Appiah (@theboylexis).</description>
    <link>https://forem.com/theboylexis</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3823588%2F4cd1e4f2-cccc-4312-8be9-60d933bae896.jpeg</url>
      <title>Forem: Alex Marfo Appiah</title>
      <link>https://forem.com/theboylexis</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/theboylexis"/>
    <language>en</language>
    <item>
      <title># How I parse 15+ MoMo SMS formats with 95% accuracy</title>
      <dc:creator>Alex Marfo Appiah</dc:creator>
      <pubDate>Sat, 14 Mar 2026 08:19:51 +0000</pubDate>
      <link>https://forem.com/theboylexis/-how-i-parse-15-momo-sms-formats-with-95-accuracy-47hg</link>
      <guid>https://forem.com/theboylexis/-how-i-parse-15-momo-sms-formats-with-95-accuracy-47hg</guid>
      <description>&lt;p&gt;Mobile Money is how West Africa moves money. In Ghana alone, MTN MoMo, Telecel Cash, and AirtelTigo Money process millions of transactions daily — all confirmed via SMS. If you want to build financial tools for these users, that SMS inbox is your data source.&lt;/p&gt;

&lt;p&gt;The problem: those messages are unstructured text. Every telco writes them differently. Every transaction type has its own format. There's no API. You get a string and you're on your own.&lt;/p&gt;

&lt;p&gt;This is how I built a parser that handles 15+ distinct SMS formats reliably.&lt;/p&gt;




&lt;h2&gt;
  
  
  The problem, concretely
&lt;/h2&gt;

&lt;p&gt;Here are three real SMS messages from three different telcos, all representing the same transaction type (money sent):&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MTN:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Payment made for GHS 50.00 to DAVID BOATENG. Current Balance: GHS 450.00. Reference: school fees. Transaction ID: 76712833868. Fee charged: GHS 0.50
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Telecel:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;0000015061132227 Confirmed. You have sent GHS50.00 to 0241234567 - JOHN MENSAH on 2025-09-10 at 13:51:07. Your Telecel Cash balance is GHS654.03.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;AirtelTigo:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;TXN 987654321. GHS 30.00 sent to KWAME ASANTE (0271234567). Balance: GHS 120.50. Thank you for using AirtelTigo Money.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same event. Completely different structure. Different field names, different ordering, different date formats, different ways of expressing the counterparty.&lt;/p&gt;

&lt;p&gt;A naive approach — split on spaces, look for "GHS" — fails immediately on edge cases like fee extraction, missing fields, or transaction IDs that look like phone numbers.&lt;/p&gt;




&lt;h2&gt;
  
  
  The architecture: 3-stage pipeline
&lt;/h2&gt;

&lt;p&gt;I settled on a regex template system with three stages.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stage 1: Telco detection
&lt;/h3&gt;

&lt;p&gt;Before matching templates, I need to know which telco sent the message. I use a combination of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Keyword signals&lt;/strong&gt; — "Telecel Cash", "MobileMoney", "AirtelTigo Money" appear in most messages&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sender ID&lt;/strong&gt; — if the app passes the sender ID (e.g. "MobileMoney", "TelecelCash"), that's a strong signal&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Structural patterns&lt;/strong&gt; — MTN always starts with "Payment made for" or "Payment received for"; Telecel messages start with a 16-digit transaction ID&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Telco detection returns a confidence score. If confidence is below threshold, I fall back to trying all templates and taking the highest-scoring match.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stage 2: Template matching
&lt;/h3&gt;

&lt;p&gt;Each (telco, transaction_type) pair has one or more regex templates. For example, &lt;code&gt;(mtn, transfer_sent)&lt;/code&gt; has this template:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Payment made for GHS (?P&amp;lt;amount&amp;gt;[\d,]+\.?\d*) to (?P&amp;lt;counterparty_name&amp;gt;[A-Z ]+)\.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.*Current Balance: GHS (?P&amp;lt;balance&amp;gt;[\d,]+\.?\d*)\.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.*Reference: (?P&amp;lt;reference&amp;gt;[^.]+)\.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.*Transaction ID: (?P&amp;lt;tx_id&amp;gt;\d+)\.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.*Fee charged: GHS (?P&amp;lt;fee&amp;gt;[\d,]+\.?\d*)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Named capture groups do the field extraction. Every significant field — amount, balance, fee, counterparty name, counterparty phone, date, time, transaction ID, reference — gets its own named group.&lt;/p&gt;

&lt;p&gt;I score each template by the number of named groups that matched. A template that captures 8/9 fields scores higher than one that captures 5/9. The best-scoring template wins.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stage 3: Field normalization
&lt;/h3&gt;

&lt;p&gt;Raw captures need cleaning:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Amounts&lt;/strong&gt;: strip commas, cast to float (&lt;code&gt;"1,299.50"&lt;/code&gt; → &lt;code&gt;1299.50&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dates&lt;/strong&gt;: normalize to ISO 8601 — Telecel uses &lt;code&gt;YYYY-MM-DD&lt;/code&gt;, some AirtelTigo messages use &lt;code&gt;DD/MM/YYYY&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Phone numbers&lt;/strong&gt;: strip leading zeros, country code variations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Names&lt;/strong&gt;: strip trailing punctuation left by greedy matches&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The result is a &lt;code&gt;ParseResult&lt;/code&gt; object with typed fields and a confidence score between 0 and 1.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where it gets hard: the edge cases
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Overlapping patterns
&lt;/h3&gt;

&lt;p&gt;A 16-digit number at the start of a Telecel message is a transaction ID. An 11-digit number mid-message is probably a phone number. A 10-digit number could be either. I use positional context in the regex — anchors and lookaheads — to disambiguate.&lt;/p&gt;

&lt;h3&gt;
  
  
  Greedy counterparty names
&lt;/h3&gt;

&lt;p&gt;Names like "ACCRA TRADERS" or "JOHN MENSAH BOATENG" are tricky because they can vary in word count. I use &lt;code&gt;[A-Z][A-Z ]+&lt;/code&gt; with a lookahead for the next known delimiter (usually " on ", " - ", or a period). Occasionally a name contains a hyphen (&lt;code&gt;KWAME ASANTE-MENSAH&lt;/code&gt;) which breaks naive &lt;code&gt;[A-Z ]+&lt;/code&gt; patterns. I handle this with a secondary pattern that allows hyphens after at least two words.&lt;/p&gt;

&lt;h3&gt;
  
  
  Missing fields
&lt;/h3&gt;

&lt;p&gt;Wallet-to-wallet transfers often don't include a fee. Balance queries don't have a counterparty. Airtime purchases never have a reference. Templates must be tolerant of missing fields — making groups optional with &lt;code&gt;?&lt;/code&gt; where appropriate, but still assigning a penalty to templates that match fewer groups.&lt;/p&gt;

&lt;h3&gt;
  
  
  Multi-line messages
&lt;/h3&gt;

&lt;p&gt;Some SMS arrive as multi-line strings (when the OS splits long messages). I strip newlines and normalize whitespace before matching.&lt;/p&gt;




&lt;h2&gt;
  
  
  The confidence score
&lt;/h2&gt;

&lt;p&gt;Every &lt;code&gt;ParseResult&lt;/code&gt; has a &lt;code&gt;confidence&lt;/code&gt; field between 0 and 1:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;1.0&lt;/code&gt; — template matched all expected groups; telco detected with high certainty&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;0.8-0.99&lt;/code&gt; — partial match (some optional fields missing) or telco detected via keywords rather than sender ID&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;0.5-0.79&lt;/code&gt; — fell back to generic template; fields extracted but telco uncertain&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;0.0&lt;/code&gt; — no template matched; message is unrecognized&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In production, I filter out results below &lt;code&gt;0.6&lt;/code&gt; confidence before passing to downstream enrichment. Unrecognized messages are logged for template improvement.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Regex is the right tool here.&lt;/strong&gt; I tried training a sequence model (token classification with a small BERT variant) to extract amounts and names. It was slower, harder to debug, and performed worse on rare transaction types with fewer training examples. Regex templates are fast, interpretable, and easy to extend.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Named groups beat positional groups.&lt;/strong&gt; &lt;code&gt;(?P&amp;lt;amount&amp;gt;...)&lt;/code&gt; instead of &lt;code&gt;(...)&lt;/code&gt; makes the extraction code clean and the templates self-documenting.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Confidence scoring pays off.&lt;/strong&gt; Being explicit about uncertainty lets downstream systems make better decisions — a lending platform can request manual review for low-confidence parses rather than silently ingesting wrong amounts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Template coverage is the real work.&lt;/strong&gt; The parser architecture took two days to build. Writing and testing templates for all 15+ formats took two weeks. Every new telco promotion, new SMS layout, or seasonal format change can break a template. I now run the corpus through the parser on every commit (CI) and alert on any drop in coverage.&lt;/p&gt;




&lt;h2&gt;
  
  
  Open source
&lt;/h2&gt;

&lt;p&gt;The parser is MIT-licensed and available at &lt;a href="https://github.com/theboylexis/momo-parse-api" rel="noopener noreferrer"&gt;github.com/theboylexis/momo-parse-api&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;momoparse
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;parser&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Payment made for GHS 50.00 to DAVID BOATENG...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;     &lt;span class="c1"&gt;# 50.0
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tx_type&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    &lt;span class="c1"&gt;# "transfer_sent"
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;confidence&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# 0.97
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The categorization model, enrichment analytics, and financial profiling layer sit on top of the open-source parser and are available via the &lt;a href="https://web-production-5aa38.up.railway.app/docs" rel="noopener noreferrer"&gt;MomoParse API&lt;/a&gt; — free sandbox key included, no sign-up.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built in Accra. Questions and PRs welcome.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>fintech</category>
      <category>api</category>
      <category>webdev</category>
    </item>
  </channel>
</rss>
