<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Paul Jacobs</title>
    <description>The latest articles on Forem by Paul Jacobs (@donpolanco).</description>
    <link>https://forem.com/donpolanco</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3809693%2Fd72bb97e-955d-4bd3-b651-e917fd1e6f7a.png</url>
      <title>Forem: Paul Jacobs</title>
      <link>https://forem.com/donpolanco</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/donpolanco"/>
    <language>en</language>
    <item>
      <title>How I Built a Privacy-First, Automated Tech News Aggregator in PHP (And Why I Ditched the Heavy Frameworks)</title>
      <dc:creator>Paul Jacobs</dc:creator>
      <pubDate>Fri, 06 Mar 2026 10:44:42 +0000</pubDate>
      <link>https://forem.com/donpolanco/how-i-built-a-privacy-first-automated-tech-news-aggregator-in-php-and-why-i-ditched-the-heavy-3ipd</link>
      <guid>https://forem.com/donpolanco/how-i-built-a-privacy-first-automated-tech-news-aggregator-in-php-and-why-i-ditched-the-heavy-3ipd</guid>
      <description>&lt;p&gt;I was tired of tech news sites that track everything you click, load slowly, and hide half their content behind paywalls. So I built my own.&lt;/p&gt;

&lt;p&gt;The result is &lt;a href="https://pulsetech.news" rel="noopener noreferrer"&gt;PulseTech.news&lt;/a&gt; — a lightning-fast, automated tech news aggregator that updates every hour, covers 16 categories, and is fully GDPR/CCPA compliant with zero creepy tracking.&lt;/p&gt;

&lt;p&gt;Here's exactly how I built it, the architectural decisions I made, and what I learned along the way.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Stack (And Why I Chose It)
&lt;/h2&gt;

&lt;p&gt;Before I get into the architecture, here's the full stack:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PHP 8.x&lt;/strong&gt; — custom lightweight framework, no Laravel, no Symfony&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MySQL 8.x&lt;/strong&gt; — via PDO with prepared statements throughout&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tailwind CSS&lt;/strong&gt; — standalone binary, no Node build pipeline&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SimplePie&lt;/strong&gt; — for RSS/Atom feed parsing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Composer&lt;/strong&gt; — for dependency management (vlucas/phpdotenv, simplepie)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The biggest decision here was &lt;strong&gt;rejecting heavy frameworks&lt;/strong&gt;. I didn't need the overhead of Laravel for what is essentially a read-heavy content site. A custom lightweight PHP framework gave me sub-100ms page loads and complete control over every byte hitting the wire.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture: The Repository Pattern
&lt;/h2&gt;

&lt;p&gt;The core of PulseTech.news is built around the &lt;strong&gt;Repository Pattern&lt;/strong&gt;. All data access is abstracted away from the page controllers and centralised in repository classes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight php"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Clean controller code — no SQL in sight&lt;/span&gt;
&lt;span class="nv"&gt;$articles&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;$articleRepo&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="nf"&gt;getLatest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;$limit&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;$offset&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;$filters&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The main repositories are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;ArticleRepository&lt;/code&gt;&lt;/strong&gt; — handles all article retrieval, including language filtering (English by default, Spanish available)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;FeedRepository&lt;/code&gt;&lt;/strong&gt; — manages feed sources and their language settings&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each repository receives a PDO instance via constructor injection, keeping the database logic contained and testable.&lt;/p&gt;

&lt;p&gt;For database access itself, I used a &lt;strong&gt;Singleton pattern&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight php"&gt;&lt;code&gt;&lt;span class="nv"&gt;$pdo&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Database&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;getInstance&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="nf"&gt;getConnection&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One connection, one point of access, consistent throughout the application.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Scraper Engine
&lt;/h2&gt;

&lt;p&gt;This was the most interesting part to build. The scraper (&lt;code&gt;classes/Scraper.php&lt;/code&gt;) runs as a headless background process on an hourly cron cycle. Here's what it does:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Feed Management&lt;/strong&gt;&lt;br&gt;
RSS and Atom feeds from the world's top tech sources are managed via an admin panel. Adding a new source is a one-click operation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Intelligent Categorisation&lt;/strong&gt;&lt;br&gt;
Rather than relying on the source's own tags (which are inconsistent), I built a weighted keyword detection system. Each article's title and description is scored against keyword sets for each category:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AI&lt;/li&gt;
&lt;li&gt;Cybersecurity&lt;/li&gt;
&lt;li&gt;Apple / iOS / iPadOS / iPhone / Mac&lt;/li&gt;
&lt;li&gt;Android / Samsung&lt;/li&gt;
&lt;li&gt;Linux&lt;/li&gt;
&lt;li&gt;Windows&lt;/li&gt;
&lt;li&gt;Gaming&lt;/li&gt;
&lt;li&gt;Robots&lt;/li&gt;
&lt;li&gt;Google / Tesla&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The weighting system ensures "AI" news stays in AI, "Cybersecurity" stays in security, and articles don't bleed into the wrong categories. This took the most iteration to get right.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Deduplication&lt;/strong&gt;&lt;br&gt;
Articles are deduplicated on source URL before insertion. No duplicate stories, even when multiple feeds cover the same news.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Shift from Vibe Coding to Agentic Engineering
&lt;/h2&gt;

&lt;p&gt;I want to be honest about the build process here, because I think it matters.&lt;/p&gt;

&lt;p&gt;The first version of PulseTech.news was largely &lt;strong&gt;vibe coded&lt;/strong&gt; — prompting AI for code, tweaking until it worked, posting screenshots. The UI looked great. But the underlying system was fragile.&lt;/p&gt;

&lt;p&gt;The real work came when I shifted to &lt;strong&gt;agentic engineering&lt;/strong&gt;: designing structured workflows, context documents, validation loops, and a full architecture overview (&lt;code&gt;PROJECT_ARCHITECTURE.md&lt;/code&gt;) that AI agents could operate within without breaking the build.&lt;/p&gt;

&lt;p&gt;The difference was enormous. Instead of getting code that &lt;em&gt;looked&lt;/em&gt; right, I got code that &lt;em&gt;behaved&lt;/em&gt; correctly within the system. The Repository Pattern, the static helpers, the testing standards — all of it was designed so that an AI agent could contribute to the codebase following the same rules as a human developer.&lt;/p&gt;


&lt;h2&gt;
  
  
  Static Helper Classes
&lt;/h2&gt;

&lt;p&gt;Rather than polluting controllers with raw superglobal access, I built a set of static helper classes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight php"&gt;&lt;code&gt;&lt;span class="nc"&gt;Session&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;Get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'user_id'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;      &lt;span class="c1"&gt;// Clean session access&lt;/span&gt;
&lt;span class="nc"&gt;Input&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;Get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'page'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;           &lt;span class="c1"&gt;// Sanitised GET/POST input&lt;/span&gt;
&lt;span class="nc"&gt;Config&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;Get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'DB_HOST'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;       &lt;span class="c1"&gt;// Environment variable access&lt;/span&gt;
&lt;span class="nc"&gt;UIHelper&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;ArticleCard&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;$data&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// Reusable UI components&lt;/span&gt;
&lt;span class="nc"&gt;Theme&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;isDark&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;              &lt;span class="c1"&gt;// Dark/light mode state&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These keep the controllers clean and make the codebase easy for AI agents (and human developers) to navigate consistently.&lt;/p&gt;




&lt;h2&gt;
  
  
  SEO &amp;amp; Structured Data
&lt;/h2&gt;

&lt;p&gt;Every listing on PulseTech.news is backed by &lt;strong&gt;JSON-LD Schema.org structured data&lt;/strong&gt;, making the site highly discoverable. The header system manages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Page-specific Open Graph and Twitter Card meta tags&lt;/li&gt;
&lt;li&gt;Canonical URLs (auto-calculated pretty URLs)&lt;/li&gt;
&lt;li&gt;JSON-LD &lt;code&gt;Organization&lt;/code&gt; and &lt;code&gt;WebSite&lt;/code&gt; schemas&lt;/li&gt;
&lt;li&gt;Dynamic &lt;code&gt;$pageTitle&lt;/code&gt;, &lt;code&gt;$pageDescription&lt;/code&gt;, and &lt;code&gt;$ogImage&lt;/code&gt; variables per page&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This was a deliberate investment in long-term organic traffic. SEO is compounding — the work you do today pays off for months.&lt;/p&gt;




&lt;h2&gt;
  
  
  Privacy First
&lt;/h2&gt;

&lt;p&gt;PulseTech.news implements &lt;strong&gt;Google Consent Mode v2&lt;/strong&gt; and &lt;strong&gt;PII-free click tracking&lt;/strong&gt;. Here's what that means in practice:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No personal data is stored on click events&lt;/li&gt;
&lt;li&gt;Full GDPR/CCPA compliance without sacrificing analytics&lt;/li&gt;
&lt;li&gt;Consent banner with genuine reject option (not a dark pattern)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This wasn't just an ethical choice — it's increasingly a legal requirement and a genuine differentiator when users are increasingly privacy-conscious.&lt;/p&gt;




&lt;h2&gt;
  
  
  Security Standards
&lt;/h2&gt;

&lt;p&gt;Every database interaction uses &lt;strong&gt;PDO prepared statements&lt;/strong&gt;. No exceptions. All POST forms include CSRF tokens, and admin routes are protected via session-based authorisation checks.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight php"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Always prepared statements — never raw interpolation&lt;/span&gt;
&lt;span class="nv"&gt;$stmt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;$pdo&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="nf"&gt;prepare&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"SELECT * FROM articles WHERE id = :id"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="nv"&gt;$stmt&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="s1"&gt;':id'&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nv"&gt;$id&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Testing
&lt;/h2&gt;

&lt;p&gt;The project uses &lt;strong&gt;PHPUnit&lt;/strong&gt; for automated testing, located in &lt;code&gt;tests/&lt;/code&gt;. Every Repository and Business Logic class has a corresponding test file. The convention is strict: &lt;code&gt;ClassNameTest.php&lt;/code&gt;, bootstrapped via &lt;code&gt;tests/bootstrap.php&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./vendor/bin/phpunit
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Having a test suite was essential when using AI agents to contribute code — it gave me a fast feedback loop to catch regressions before they hit production.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;PulseTech.news is live and updating hourly. Here's what's on the roadmap:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;User accounts&lt;/strong&gt; with a personal 'Read Later' library&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI-driven personalised feeds&lt;/strong&gt; — only see the categories and sources you care about&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;More sources and languages&lt;/strong&gt; — currently English and Spanish, expanding soon&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;👉 &lt;a href="https://pulsetech.news" rel="noopener noreferrer"&gt;pulsetech.news&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It's completely free. No paywalls. No bloat. 16 categories updated every hour.&lt;/p&gt;

&lt;p&gt;I'd love feedback on the speed, the dark mode UI, and any tech sources you think I should add to the scraper. Drop them in the comments below.&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>php</category>
      <category>buildinpublic</category>
    </item>
  </channel>
</rss>
