<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Sylvain Artois</title>
    <description>The latest articles on Forem by Sylvain Artois (@sartois).</description>
    <link>https://forem.com/sartois</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3847939%2F75685089-163d-4e70-8a9d-27bab436dba3.jpg</url>
      <title>Forem: Sylvain Artois</title>
      <link>https://forem.com/sartois</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/sartois"/>
    <language>en</language>
    <item>
      <title>Full-Text Search on a 2 GB PostgreSQL Instance</title>
      <dc:creator>Sylvain Artois</dc:creator>
      <pubDate>Tue, 12 May 2026 08:46:55 +0000</pubDate>
      <link>https://forem.com/sartois/full-text-search-on-a-2-gb-postgresql-instance-337j</link>
      <guid>https://forem.com/sartois/full-text-search-on-a-2-gb-postgresql-instance-337j</guid>
      <description>&lt;p&gt;&lt;em&gt;AFK.live aggregates headlines from 250+ news sources. I needed search. I had a 2 GB database and no budget for Elasticsearch.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fair warning&lt;/strong&gt;: this article goes deep into PostgreSQL internals — query plans, GIN indexes, trigram scoring. I've added links throughout for context. If you're not familiar with these concepts, the links are there for you. If you are, skip them and enjoy the war stories.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Constraint
&lt;/h2&gt;

&lt;p&gt;AFK runs on a Scaleway managed PostgreSQL instance — the smallest tier they offer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;DB-PLAY2-PICO&lt;/strong&gt;: 1 vCPU, 2 GB RAM&lt;/li&gt;
&lt;li&gt;Cost: roughly 70% of the entire AFK hosting budget (50€)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No PGVector. No Elasticsearch. No Typesense. Just PostgreSQL 16 with the &lt;code&gt;pg_trgm&lt;/code&gt; and &lt;code&gt;unaccen&lt;/code&gt; extensions, which Scaleway includes by default on managed instances.&lt;/p&gt;

&lt;p&gt;The headlines table has grown past 1 million rows. I needed fuzzy search that handles French accents, returns results in under 300ms, and doesn't bring the tiny instance to its knees.&lt;/p&gt;

&lt;p&gt;Here's how I got there — and the bugs I hit along the way.&lt;/p&gt;




&lt;h2&gt;
  
  
  Building the Foundation: A Materialized View
&lt;/h2&gt;

&lt;p&gt;Searching directly against the &lt;code&gt;headlines&lt;/code&gt; table would mean joining with &lt;code&gt;sources&lt;/code&gt; on every query, and computing &lt;code&gt;lower(unaccent(title))&lt;/code&gt; at query time for every candidate row. On a 1 vCPU machine, that's not going to work.&lt;/p&gt;

&lt;p&gt;The solution is a &lt;a href="https://www.postgresql.org/docs/current/rules-materializedviews.html" rel="noopener noreferrer"&gt;materialized view&lt;/a&gt; that pre-computes the normalized search field:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;MATERIALIZED&lt;/span&gt; &lt;span class="k"&gt;VIEW&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="n"&gt;headlines_search&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;published_at&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;source_slug&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;unaccent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;search_title&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;headlines&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;sources&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;source_slug&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;slug&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt;
    &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;exclude_from_search&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;false&lt;/span&gt;
    &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;published_at&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;search_title&lt;/code&gt; is pre-lowercased and unaccented at materialization time, not at query time. The view also filters out sources I've flagged as excluded (far-right sources).&lt;/p&gt;

&lt;p&gt;On top of this, three indexes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- GIN trigram index for fuzzy text matching&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;idx_headlines_search_title_gin&lt;/span&gt;
&lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;headlines_search&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;gin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;search_title&lt;/span&gt; &lt;span class="n"&gt;gin_trgm_ops&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- For chronological ordering and date range filters&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;idx_headlines_search_published_at&lt;/span&gt;
&lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;headlines_search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;published_at&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- For source filtering&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;idx_headlines_search_source_slug&lt;/span&gt;
&lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;headlines_search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;source_slug&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;a href="https://pganalyze.com/blog/gin-index" rel="noopener noreferrer"&gt;GIN trigram index&lt;/a&gt; is the core of the search: it breaks every &lt;code&gt;search_title&lt;/code&gt; into 3-character grams and indexes them, enabling &lt;a href="https://www.postgresql.org/docs/current/pgtrgm.html" rel="noopener noreferrer"&gt;fuzzy matching&lt;/a&gt; without full table scans.&lt;/p&gt;

&lt;h3&gt;
  
  
  Refreshing the View
&lt;/h3&gt;

&lt;p&gt;A materialized view is a snapshot. New headlines are invisible until you refresh it. AFK's CI/CD pipeline deploys every two hours, so I added the refresh as the first step of the deployment script:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# deploy.legacy.sh — runs before Astro build&lt;/span&gt;
&lt;span class="nb"&gt;env &lt;/span&gt;&lt;span class="nv"&gt;PGPASSWORD&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;SCALEWAY_DB_PASSWORD&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; psql &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-h&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;SCALEWAY_DB_HOST&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;SCALEWAY_DB_PORT&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-U&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;SCALEWAY_DB_USER&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;SCALEWAY_DB_NAME&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"REFRESH MATERIALIZED VIEW headlines_search;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"headlines_search refreshed successfully"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"headlines_search refresh failed, continuing with stale data"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A note on &lt;code&gt;REFRESH MATERIALIZED VIEW CONCURRENTLY&lt;/code&gt;: it exists, and I have the unique index it requires. But over the remote connection to Scaleway's managed instance, the concurrent refresh kept hitting SSL EOF errors on longer operations. The non-concurrent variant works reliably — but it locks the view for the duration of the refresh. During that lock, any search query hitting the view will fail. We deploy during business hours, so this is a real trade-off: a few seconds of downtime every two hours. On a site with a few concurrent users, that's acceptable. At larger scale, you'd want to schedule refreshes overnight and accept 24 hours of staleness — or fix the SSL issue and use &lt;code&gt;CONCURRENTLY&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Bug #1: The View That Was Never Refreshed
&lt;/h2&gt;

&lt;p&gt;The search page went live. It worked. Then a user pointed out that no results appeared after November 2025.&lt;/p&gt;

&lt;p&gt;The materialized view had &lt;strong&gt;never been refreshed&lt;/strong&gt; since its creation. It had 357K rows. The &lt;code&gt;headlines&lt;/code&gt; table had over 1 million. Two-thirds of the articles were invisible to search.&lt;/p&gt;

&lt;p&gt;The frustrating part: I had planned for this. The git history tells the story. The initial commit (October 2025) didn't just create the view — it also created a &lt;code&gt;refresh_headlines_search()&lt;/code&gt; PL/pgSQL function and configured two &lt;a href="https://github.com/mcuadros/ofelia" rel="noopener noreferrer"&gt;Ofelia&lt;/a&gt; scheduler jobs in &lt;code&gt;scheduler-config.ini&lt;/code&gt;: one for weekdays (refreshing nine times a day, from 6:15 to 22:15 CET) and one for Sundays (five times a day). The infrastructure was designed correctly.&lt;/p&gt;

&lt;p&gt;But the scheduler jobs referenced environment variables (&lt;code&gt;${SCALEWAY_DB_ADMIN_PASSWORD}&lt;/code&gt;) that were never injected into the container. The jobs ran on schedule, failed silently, and nobody noticed for five months. The &lt;code&gt;refresh_headlines_search()&lt;/code&gt; function I carefully wrote in the migration? Never called once.&lt;/p&gt;

&lt;p&gt;Twenty years of engineering experience, and I forgot to check that my cron jobs actually worked. The fix was the deploy script change above — crude but reliable, because it runs in the same environment that already has the database credentials.&lt;/p&gt;

&lt;p&gt;The lesson: materialized views are a pit of success when they work, and a silent disaster when they go stale. There's no built-in warning. If you use them, &lt;strong&gt;monitor the row count&lt;/strong&gt; or the most recent &lt;code&gt;published_at&lt;/code&gt; — don't assume your refresh mechanism works just because it exists.&lt;/p&gt;




&lt;h2&gt;
  
  
  Bug #2: The 4.4-Second Query
&lt;/h2&gt;

&lt;p&gt;After the refresh brought the view to ~966K rows, the next problem appeared immediately. Searching "donald trump" took &lt;strong&gt;4.4 seconds&lt;/strong&gt; and triggered an HTTP 504 timeout. (The short timeout is intentional — I'd rather show an error than let a slow query go unnoticed. If search is too slow, I want to know about it before users start complaining.)&lt;/p&gt;

&lt;p&gt;Here's what the original query looked like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;hs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;published_at&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;logo&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;slug&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;similarity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;search_title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;unaccent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;headlines_search&lt;/span&gt; &lt;span class="n"&gt;hs&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;sources&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;hs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;source_slug&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;slug&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;hs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;search_title&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="k"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;unaccent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;hs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;published_at&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;a href="https://www.postgresql.org/docs/current/pgtrgm.html#PGTRGM-OP-TABLE" rel="noopener noreferrer"&gt;&lt;code&gt;%&lt;/code&gt; operator&lt;/a&gt; is pg_trgm's similarity operator. It uses the GIN index. Sounds efficient. Let's look at &lt;a href="https://www.postgresql.org/docs/current/using-explain.html" rel="noopener noreferrer"&gt;EXPLAIN ANALYZE&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;Bitmap&lt;/span&gt; &lt;span class="k"&gt;Index&lt;/span&gt; &lt;span class="n"&gt;Scan&lt;/span&gt; &lt;span class="k"&gt;on&lt;/span&gt; &lt;span class="n"&gt;idx_headlines_search_title_gin&lt;/span&gt;
  &lt;span class="k"&gt;Index&lt;/span&gt; &lt;span class="n"&gt;Cond&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;search_title&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="k"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;unaccent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'donald trump'&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
  &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;55508&lt;/span&gt;

&lt;span class="n"&gt;Bitmap&lt;/span&gt; &lt;span class="n"&gt;Heap&lt;/span&gt; &lt;span class="n"&gt;Scan&lt;/span&gt; &lt;span class="k"&gt;on&lt;/span&gt; &lt;span class="n"&gt;headlines_search&lt;/span&gt; &lt;span class="n"&gt;hs&lt;/span&gt;
  &lt;span class="k"&gt;Rows&lt;/span&gt; &lt;span class="n"&gt;Removed&lt;/span&gt; &lt;span class="k"&gt;by&lt;/span&gt; &lt;span class="k"&gt;Index&lt;/span&gt; &lt;span class="k"&gt;Recheck&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;55351&lt;/span&gt;
  &lt;span class="n"&gt;Heap&lt;/span&gt; &lt;span class="n"&gt;Blocks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;exact&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;32113&lt;/span&gt;
  &lt;span class="n"&gt;Execution&lt;/span&gt; &lt;span class="nb"&gt;Time&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;4443&lt;/span&gt;&lt;span class="n"&gt;ms&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The GIN index returned &lt;strong&gt;55,508 candidates&lt;/strong&gt;. After the recheck filter, only 157 survived — 99.7% noise. PostgreSQL had to load 32,113 heap blocks (a &lt;a href="https://www.postgresql.org/docs/current/storage-page-layout.html" rel="noopener noreferrer"&gt;heap block&lt;/a&gt; is an 8 KB page of actual table data — 32K blocks means ~250 MB of disk reads) to evaluate the similarity predicate on each candidate. On a 2 GB instance, that's a significant chunk of available memory spent on I/O.&lt;/p&gt;

&lt;p&gt;The problem is how GIN trigram indexes work: they find all rows that share &lt;em&gt;any&lt;/em&gt; trigrams with the query. For a common term like "donald trump", that's a lot of rows.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Fix: &lt;code&gt;strict_word_similarity&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;PostgreSQL's &lt;a href="https://www.postgresql.org/docs/current/pgtrgm.html" rel="noopener noreferrer"&gt;pg_trgm extension&lt;/a&gt; offers three similarity functions, and they behave very differently:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Function&lt;/th&gt;
&lt;th&gt;What it measures&lt;/th&gt;
&lt;th&gt;Operator&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;similarity(a, b)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;How similar the full strings are&lt;/td&gt;
&lt;td&gt;&lt;code&gt;%&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;word_similarity(a, b)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Whether &lt;code&gt;a&lt;/code&gt; matches any word in &lt;code&gt;b&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;&lt;code&gt;&amp;lt;%&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;strict_word_similarity(a, b)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Whether &lt;code&gt;a&lt;/code&gt; matches a contiguous word sequence in &lt;code&gt;b&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;&lt;code&gt;&amp;lt;&amp;lt;%&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;code&gt;similarity&lt;/code&gt; compares the entire query against the entire title — terrible for short queries against long titles. &lt;code&gt;word_similarity&lt;/code&gt; is more lenient but still too broad. &lt;code&gt;strict_word_similarity&lt;/code&gt; requires the query to match a &lt;em&gt;contiguous sequence of words&lt;/em&gt; in the title. That's exactly what you want when someone searches for a proper noun like "donald trump" or "réchauffement climatique".&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;&amp;lt;&amp;lt;%&lt;/code&gt; operator also changed the query plan. Instead of the GIN bitmap scan, PostgreSQL used an incremental sort on the &lt;code&gt;published_at&lt;/code&gt; index — scanning from most recent to oldest and stopping once it had enough results:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;hs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;published_at&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;logo&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;slug&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;strict_word_similarity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;unaccent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt; &lt;span class="n"&gt;hs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;search_title&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;headlines_search&lt;/span&gt; &lt;span class="n"&gt;hs&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;sources&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;hs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;source_slug&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;slug&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;unaccent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;%&lt;/span&gt; &lt;span class="n"&gt;hs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;search_title&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;hs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;published_at&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The results:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Version&lt;/th&gt;
&lt;th&gt;Execution time&lt;/th&gt;
&lt;th&gt;Index used&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Original (&lt;code&gt;%&lt;/code&gt; + similarity)&lt;/td&gt;
&lt;td&gt;4,444ms&lt;/td&gt;
&lt;td&gt;GIN (55K candidates)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Raised threshold to 0.3&lt;/td&gt;
&lt;td&gt;836ms&lt;/td&gt;
&lt;td&gt;GIN (55K candidates)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;word_similarity&lt;/code&gt; &lt;code&gt;&amp;lt;%&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;765ms&lt;/td&gt;
&lt;td&gt;published_at&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;code&gt;strict_word_similarity&lt;/code&gt; &lt;code&gt;&amp;lt;&amp;lt;%&lt;/code&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;221ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;published_at&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;From 4.4 seconds to 221ms. The 2 GB instance can handle that.&lt;/p&gt;

&lt;p&gt;The trade-off is real, though: &lt;code&gt;strict_word_similarity&lt;/code&gt; matches contiguous word sequences, not approximate ones. It won't forgive typos the way &lt;code&gt;similarity&lt;/code&gt; does — "donlad trump" returns nothing. You're trading fuzzy tolerance for speed and precision. For a news search where users type proper nouns they already know how to spell, that's the right call. For a general-purpose search box, it might not be.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Relevance Problem
&lt;/h2&gt;

&lt;p&gt;Fast isn't useful if the results are wrong. A user reported that searching "emmanuel macron" returned headlines about Emmanuel Bellanger, Emmanuel Grégoire, and Emmanuel-Philibert de Savoie mixed in with actual Macron results.&lt;/p&gt;

&lt;p&gt;I tested with synthetic data to understand the scoring:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Title&lt;/th&gt;
&lt;th&gt;
&lt;code&gt;strict_word_similarity&lt;/code&gt; score&lt;/th&gt;
&lt;th&gt;Matches?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;"emmanuel macron annonce une réforme"&lt;/td&gt;
&lt;td&gt;1.0&lt;/td&gt;
&lt;td&gt;yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"emmanuel bellanger critique le gouvernement"&lt;/td&gt;
&lt;td&gt;0.5625&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;yes&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"emmanuel grégoire quitte la mairie de paris"&lt;/td&gt;
&lt;td&gt;0.5625&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;yes&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"macron reçoit biden à l'élysée"&lt;/td&gt;
&lt;td&gt;0.4375&lt;/td&gt;
&lt;td&gt;no&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The issue: "emmanuel" alone is 8 characters out of the 16-character query. That's enough trigram overlap to produce a score of 0.5625 — well above the default &lt;code&gt;pg_trgm.strict_word_similarity_threshold&lt;/code&gt; of 0.3. Any "Emmanuel X" headline passes the filter.&lt;/p&gt;

&lt;p&gt;On the real dataset, the score distribution confirmed it:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;th&gt;Count&lt;/th&gt;
&lt;th&gt;What they are&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1.00&lt;/td&gt;
&lt;td&gt;4,246&lt;/td&gt;
&lt;td&gt;Actual "Emmanuel Macron" headlines&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0.56&lt;/td&gt;
&lt;td&gt;977&lt;/td&gt;
&lt;td&gt;"Emmanuel [someone else]" noise&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;19% noise. And the problem is specific to queries containing common first names — "réchauffement climatique" worked perfectly because both words are long and distinctive.&lt;/p&gt;

&lt;h3&gt;
  
  
  Threshold Tuning: SET vs WHERE
&lt;/h3&gt;

&lt;p&gt;Two ways to raise the bar:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Option A&lt;/strong&gt; — &lt;a href="https://www.postgresql.org/docs/current/sql-set.html" rel="noopener noreferrer"&gt;&lt;code&gt;SET&lt;/code&gt;&lt;/a&gt; the threshold before the query:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;pg_trgm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;strict_word_similarity_threshold&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;unaccent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;%&lt;/span&gt; &lt;span class="n"&gt;hs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;search_title&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Option B&lt;/strong&gt; — add a &lt;code&gt;WHERE&lt;/code&gt; clause:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;unaccent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;%&lt;/span&gt; &lt;span class="n"&gt;hs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;search_title&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;strict_word_similarity&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;They produce the same results but perform very differently. The &lt;code&gt;SET&lt;/code&gt; approach tells the GIN index to use 0.8 as its scan threshold — it returns 4,295 candidates. The &lt;code&gt;WHERE&lt;/code&gt; approach uses the default 0.3 for the index scan (8,460 candidates), then discards rows in the heap.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- SET approach: 166ms, 4,295 index candidates&lt;/span&gt;
&lt;span class="c1"&gt;-- WHERE approach: 348ms, 8,460 index candidates&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;SET&lt;/code&gt; is 2x faster because it pushes the selectivity into the index. The catch is that &lt;code&gt;SET&lt;/code&gt; is a session-level command — it changes the threshold for the entire database connection, not just one query. If you're using a connection pool (and you should be), that modified connection goes back into the pool after your request. The next request that picks it up inherits the 0.8 threshold, which might not be what it expects.&lt;/p&gt;

&lt;p&gt;How we handle this is covered in the next section.&lt;/p&gt;




&lt;h2&gt;
  
  
  The API Endpoint
&lt;/h2&gt;

&lt;p&gt;Everything comes together in a single Astro SSR endpoint at &lt;code&gt;/api/search/headlines&lt;/code&gt;. Here's the core flow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Get a dedicated client from the pool&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;databaseService&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getClient&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// Set threshold only when there's a text query&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;hasQuery&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;databaseService&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;executeQueryWithClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
      &lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;SET pg_trgm.strict_word_similarity_threshold = 0.8&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="c1"&gt;// Execute search and count queries on the same client&lt;/span&gt;
  &lt;span class="nx"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;databaseService&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;executeQueryWithClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;params&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;countResult&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;databaseService&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;executeQueryWithClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nx"&gt;countQuery&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nx"&gt;countParams&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nx"&gt;totalCount&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;parseInt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;countResult&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]?.&lt;/span&gt;&lt;span class="nx"&gt;total_count&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;0&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;finally&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;release&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;SET&lt;/code&gt; command only fires when there's a text query. If the user is just browsing by source or date — no text, no threshold change.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;finally&lt;/code&gt; block is where we handle the session-level &lt;code&gt;SET&lt;/code&gt; problem. When &lt;code&gt;client.release()&lt;/code&gt; returns the connection to the pool, it carries the 0.8 threshold with it. In our case, this is fine: the search endpoint is the only consumer of &lt;code&gt;strict_word_similarity&lt;/code&gt;, so every text search sets the threshold before querying. No other code path depends on the default 0.3. If it did, we'd need to &lt;code&gt;SET pg_trgm.strict_word_similarity_threshold = DEFAULT&lt;/code&gt; before releasing — or use &lt;code&gt;SET LOCAL&lt;/code&gt; inside a transaction, which scopes the setting to that transaction only.&lt;/p&gt;

&lt;p&gt;The query is built dynamically based on which filters are present:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Text search: strict_word_similarity with &amp;lt;&amp;lt;% operator&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;hasQuery&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;conditions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`lower(unaccent($&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;paramCounter&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;)) &amp;lt;&amp;lt;% hs.search_title`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// Optional source, date_from, date_to filters&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;source_slug&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;conditions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`hs.source_slug = $&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;paramCounter&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The scoring column adapts too: &lt;code&gt;strict_word_similarity(...)&lt;/code&gt; when there's a text query, &lt;code&gt;0&lt;/code&gt; otherwise. Results are ordered by &lt;code&gt;published_at DESC, score DESC&lt;/code&gt; — most recent first, highest relevance as tiebreaker.&lt;/p&gt;

&lt;p&gt;Pagination uses &lt;code&gt;LIMIT&lt;/code&gt;/&lt;code&gt;OFFSET&lt;/code&gt; with a separate &lt;code&gt;COUNT(*)&lt;/code&gt; query for the total. Responses get a 5-minute &lt;code&gt;Cache-Control&lt;/code&gt; header since the view only changes at deploy time anyway.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Frontend
&lt;/h2&gt;

&lt;p&gt;The search page is an Astro SSR page with an &lt;a href="https://alpinejs.dev/" rel="noopener noreferrer"&gt;Alpine.js&lt;/a&gt; component handling all client-side interaction. No framework overhead — Alpine weighs ~15 KB and feels right for a form with a few filters.&lt;/p&gt;

&lt;p&gt;The component manages four concerns:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Source autocomplete&lt;/strong&gt;: Sources are loaded from an Astro content collection into a JSON data island at build time. The autocomplete filters client-side — no API call needed. Selecting a source shows it as a chip with the outlet's favicon.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Date range filters&lt;/strong&gt;: Two date inputs that map directly to &lt;code&gt;date_from&lt;/code&gt; and &lt;code&gt;date_to&lt;/code&gt; query parameters. Simple, but together with the source filter they allow powerful browsing: "show me everything from Le Monde in March 2026".&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;URL state sync&lt;/strong&gt;: Every search updates the browser URL via &lt;code&gt;history.replaceState()&lt;/code&gt; — &lt;code&gt;?q=macron&amp;amp;source_slug=lemonde&amp;amp;page=2&lt;/code&gt;. Bookmarkable, shareable. On page load, &lt;code&gt;init()&lt;/code&gt; reads the URL params back and re-executes the search.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pagination&lt;/strong&gt;: Page state is managed in Alpine and synced to the URL. The API returns &lt;code&gt;total_count&lt;/code&gt;, Alpine computes &lt;code&gt;totalPages&lt;/code&gt;, and the UI shows Previous/Next controls. Page changes trigger a new API call and smooth-scroll back to the top.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="nf"&gt;executeSearch&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;URLSearchParams&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="nx"&gt;params&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;q&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nx"&gt;params&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;limit&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;pageSize&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
  &lt;span class="nx"&gt;params&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;offset&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;currentPage&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;pageSize&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;

  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;selectedSource&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="nx"&gt;params&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;source_slug&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;selectedSource&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;slug&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;dateFrom&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="nx"&gt;params&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;date_from&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;dateFrom&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;dateTo&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="nx"&gt;params&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;date_to&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;dateTo&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="c1"&gt;// Update browser URL without reload&lt;/span&gt;
  &lt;span class="nx"&gt;history&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replaceState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;`?&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;browserParams&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toString&lt;/span&gt;&lt;span class="p"&gt;()}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`/api/search/headlines/?&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;params&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toString&lt;/span&gt;&lt;span class="p"&gt;()}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="c1"&gt;// ...&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No debounce, no search-as-you-type. The user submits the form, sees results. Fast enough that it feels instant.&lt;/p&gt;




&lt;h2&gt;
  
  
  What a 2 GB Database Can Do
&lt;/h2&gt;

&lt;p&gt;The final stack:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Materialized view&lt;/strong&gt; with pre-computed &lt;code&gt;lower(unaccent(title))&lt;/code&gt; — avoids runtime text normalization on 966K rows&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GIN trigram index&lt;/strong&gt; — handles fuzzy matching with French accents&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;strict_word_similarity&lt;/code&gt;&lt;/strong&gt; with &lt;code&gt;&amp;lt;&amp;lt;%&lt;/code&gt; operator — 20x faster than &lt;code&gt;similarity&lt;/code&gt; for multi-word queries&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Session-level threshold at 0.8&lt;/strong&gt; — pushes selectivity into the index, eliminates 19% noise on name queries&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Astro SSR + Alpine.js&lt;/strong&gt; — server-side API, lightweight client&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;5-minute cache&lt;/strong&gt; — the view only changes at deploy time anyway&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Average query time: &lt;strong&gt;~200ms&lt;/strong&gt; for text search, faster for filter-only browsing. On the smallest managed PostgreSQL instance Scaleway offers.&lt;/p&gt;

&lt;p&gt;The lesson is boring but worth repeating: before reaching for a search service, check what your existing database can do. PostgreSQL's &lt;code&gt;pg_trgm&lt;/code&gt; isn't Elasticsearch — there's no stemming, no typo tolerance beyond trigram overlap, no field boosting. But for searching a million news headlines in a single language, it's more than enough.&lt;/p&gt;

&lt;p&gt;The real optimization work wasn't about choosing the right tool. It was about reading EXPLAIN ANALYZE output, understanding how GIN indexes actually scan, and tuning a single threshold value.&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>database</category>
      <category>astro</category>
    </item>
    <item>
      <title>BERTopic Hyperparameter Tuning: Building a Speed-Optimized Grid Search System</title>
      <dc:creator>Sylvain Artois</dc:creator>
      <pubDate>Fri, 03 Apr 2026 14:28:27 +0000</pubDate>
      <link>https://forem.com/sartois/bertopic-hyperparameter-tuning-building-a-speed-optimized-grid-search-system-4g3l</link>
      <guid>https://forem.com/sartois/bertopic-hyperparameter-tuning-building-a-speed-optimized-grid-search-system-4g3l</guid>
      <description>&lt;p&gt;&lt;strong&gt;NLP clustering&lt;/strong&gt; isn't exactly intuitive for a senior developer dipping their toes into &lt;strong&gt;machine learning&lt;/strong&gt; waters. Unlike debugging a server issue or fixing a constantly re-rendering React component, I have zero intuition about what parameter values might work. And honestly? The literature doesn't always help.&lt;/p&gt;

&lt;p&gt;So I built myself a hyperparameter tuner for &lt;a href="https://maartengr.github.io/BERTopic/" rel="noopener noreferrer"&gt;BERTopic&lt;/a&gt;, designed to test the maximum number of parameter combinations in the shortest possible time. It's essentially a race against the clock: how can we make experimentation fast enough to explore a vast parameter space?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://gitlab.com/sartois/bertopic-hyperparameter-tuning" rel="noopener noreferrer"&gt;The source code is available on GitLab&lt;/a&gt; for anyone who wants to experiment. The project is packaged as a proper Python package — install it with &lt;code&gt;pip install -e .&lt;/code&gt; and use the &lt;code&gt;bertopic-tune&lt;/code&gt; and &lt;code&gt;bertopic-model&lt;/code&gt; CLI commands.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Challenge: From Manual Intuition to Systematic Search
&lt;/h2&gt;

&lt;p&gt;When you're working with &lt;strong&gt;BERTopic&lt;/strong&gt;, you're actually dealing with multiple algorithms working in concert:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://umap-learn.readthedocs.io/" rel="noopener noreferrer"&gt;UMAP&lt;/a&gt;&lt;/strong&gt; for dimensionality reduction&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://hdbscan.readthedocs.io/" rel="noopener noreferrer"&gt;HDBSCAN&lt;/a&gt;&lt;/strong&gt; for clustering&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;BERTopic&lt;/strong&gt; itself for topic extraction&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each has its own parameters, and they interact in non-obvious ways. The documentation might tell you that &lt;code&gt;n_neighbors&lt;/code&gt; in &lt;strong&gt;UMAP&lt;/strong&gt; typically ranges from 5 to 50, but what works for your specific dataset? That's where systematic testing comes in.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setting Up the Data Pipeline
&lt;/h2&gt;

&lt;p&gt;First, I formalized a data format (detailed in the &lt;a href="https://gitlab.com/sartois/bertopic-hyperparameter-tuning/-/blob/main/README.md" rel="noopener noreferrer"&gt;README&lt;/a&gt;). The idea is to automate everything: SQL export of collected data, manual clustering or &lt;a href="https://sylvain.artois.io/tech/using-claude-sonnet-for-nlp-clustering" rel="noopener noreferrer"&gt;clustering done by an LLM&lt;/a&gt;. You could easily imagine an n8n workflow that runs this script every two weeks or monthly to track semantic evolution in your data.&lt;/p&gt;

&lt;p&gt;The ground truth is a simple JSON with cluster IDs as keys and lists of text IDs as values:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"7"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;288657&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;304729&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;314406&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"6"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;308809&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;309885&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;311726&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;312668&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"4"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="err"&gt;/*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;...*/&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="err"&gt;//...&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The script expects a parameter grid to test:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"umap_n_neighbors"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"umap_n_components"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"umap_min_dist"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"hdbscan_min_cluster_size"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"hdbscan_min_samples"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"hdbscan_cluster_selection_epsilon"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"bertopic_top_n_words"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"nr_topics"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"none"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"n_gram_max"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"embedding_model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"dangvantuan/sentence-camembert-large"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"remove_stopwords"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is just a refinement grid - I'd already run the script once on this dataset and was narrowing down the search space. Initial runs often have much wider parameter ranges.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Speed Optimization: Pre-Computing Embeddings
&lt;/h2&gt;

&lt;p&gt;Once the dataset is ready (examples are available in the repo), the first optimization is pre-calculating embeddings. While not computationally heavy, this step is crucial because it allows us to bypass GPU usage during the actual parameter search, enabling massive parallelization.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;precalculate_embeddings&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;texts_dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]],&lt;/span&gt;
    &lt;span class="n"&gt;embedding_models&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;pooling_mode&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cuda&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Pre-calculate all embeddings for all combinations of models and stopword settings.
    This prevents multiple parallel processes from trying to compute embeddings simultaneously.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Pre-calculating embeddings for all model/stopword combinations...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;model_name&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;embedding_models&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;model_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;ensure_custom_embedding_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pooling_mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;pooling_mode&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;remove_stopwords&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
            &lt;span class="n"&gt;texts_for_embeddings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;texts_dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;processed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;remove_stopwords&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;texts_dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;original&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

            &lt;span class="nf"&gt;get_cached_embeddings&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;texts_for_embeddings&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;model_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;pooling_mode&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;remove_stopwords&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;is_available&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
                &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;empty_cache&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The embeddings are cached to disk using &lt;strong&gt;Python's pickle format&lt;/strong&gt; (&lt;code&gt;.pkl&lt;/code&gt; files), which allows for fast serialization and deserialization of numpy arrays. Each cache file is uniquely identified by a combination of the model name, pooling mode, stopword removal setting, and a hash of the corpus - ensuring we never accidentally use the wrong embeddings.&lt;/p&gt;

&lt;h2&gt;
  
  
  Parallel Execution with ProcessPoolExecutor
&lt;/h2&gt;

&lt;p&gt;The CLI entry point &lt;code&gt;main_tuner()&lt;/code&gt; in &lt;code&gt;bertopic_tuner/cli.py&lt;/code&gt; iterates through each possible combination using &lt;strong&gt;ProcessPoolExecutor&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;    &lt;span class="c1"&gt;# Pre-calculate all embeddings before parallel processing
&lt;/span&gt;    &lt;span class="nf"&gt;precalculate_embeddings&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;texts_dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;unique_embedding_models&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pooling_mode&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Prepare all parameter combinations
&lt;/span&gt;    &lt;span class="n"&gt;tasks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;param_combinations&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;ts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;strftime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;%Y%m%d_%H%M%S_%f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)[:&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;run_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ParameterTuning_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;manual_clusters_path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;manual_clusters_path&lt;/span&gt;
        &lt;span class="n"&gt;tasks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;batch_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;experiment_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;texts_dict&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="n"&gt;max_workers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max_workers&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Running &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tasks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; tasks with &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;max_workers&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; parallel processes...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;concurrent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;futures&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ProcessPoolExecutor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_workers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;max_workers&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;executor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;future_to_task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;executor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;submit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run_berttopic_with_params&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tasks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;future&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;concurrent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;futures&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;as_completed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;future_to_task&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="c1"&gt;# Process results...
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each worker function &lt;code&gt;run_berttopic_with_params&lt;/code&gt; retrieves the cached embeddings and calls &lt;code&gt;modeler.run()&lt;/code&gt; directly as a Python function — no subprocess spawning, no temporary files.&lt;/p&gt;

&lt;p&gt;Why &lt;strong&gt;ProcessPoolExecutor&lt;/strong&gt;? It's the perfect tool for CPU-bound parallel tasks in Python. Since we've pre-computed embeddings, each &lt;strong&gt;BERTopic&lt;/strong&gt; run is purely CPU-intensive (&lt;strong&gt;UMAP&lt;/strong&gt; and &lt;strong&gt;HDBSCAN&lt;/strong&gt; calculations). The &lt;a href="https://wiki.python.org/moin/GlobalInterpreterLock" rel="noopener noreferrer"&gt;Global Interpreter Lock (GIL)&lt;/a&gt; doesn't affect separate processes, so we get true parallelism. With threads, we'd be limited by the GIL; with processes, we can use all available CPU cores effectively.&lt;/p&gt;

&lt;h2&gt;
  
  
  MLflow Integration for Experiment Tracking
&lt;/h2&gt;

&lt;p&gt;Inside &lt;code&gt;bertopic_tuner/modeler.py&lt;/code&gt;, we use &lt;a href="https://mlflow.org/" rel="noopener noreferrer"&gt;MLflow&lt;/a&gt; to track each experiment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;setup_model_with_logging&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model_class&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prefix&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;exclude_keys&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Setup a model with MLflow parameter logging.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;exclude_keys&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;exclude_keys&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="n"&gt;mlflow_params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;prefix&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;exclude_keys&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;mlflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log_params&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mlflow_params&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;model_class&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We build &lt;strong&gt;BERTopic&lt;/strong&gt; parameters based on the received configuration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;umap_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;n_neighbors&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;umap_n_neighbors&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;n_components&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;umap_n_components&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;min_dist&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;umap_min_dist&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metric&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cosine&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;random_state&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key is to run BERTopic without any &lt;a href="https://maartengr.github.io/BERTopic/getting_started/representation/representation.html" rel="noopener noreferrer"&gt;representation models&lt;/a&gt;. While these are perfect in production for refining clusters, they're unnecessary here - we just want to measure how well a parameter combination matches our ground truth.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;topic_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BERTopic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;umap_model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;umap_model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;hdbscan_model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;hdbscan_model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;vectorizer_model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;vectorizer_model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;top_n_words&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;bertopic_top_n_words&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;verbose&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;nr_topics&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;nr_topics&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;low_memory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;n_gram_range&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;n_gram_max&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Evaluation Metrics: Measuring Success
&lt;/h2&gt;

&lt;p&gt;In &lt;code&gt;bertopic_tuner/evaluation.py&lt;/code&gt;, we calculate various metrics, but I primarily focus on two:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;A_f1_score_assignment&lt;/strong&gt;: Measures how well the predicted clusters match the ground truth assignments&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A_ARI (Adjusted Rand Index)&lt;/strong&gt;: A chance-corrected measure of clustering similarity&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For a deeper dive into these metrics, check out &lt;a href="https://sylvain.artois.io/tech/nlp-supervised-clustering-evaluation-back-to-basics" rel="noopener noreferrer"&gt;my blog post on supervised clustering evaluation&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Analyzing Results in MLflow
&lt;/h2&gt;

&lt;p&gt;Once experiments are complete, MLflow provides powerful tools for analysis:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Parallel Coordinates Plot&lt;/strong&gt;: Visualize how different parameter combinations affect metrics&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi8kxp1jnra75vyfibe9r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi8kxp1jnra75vyfibe9r.png" alt="ML-Flow cluster card" width="800" height="444"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Representative Documents&lt;/strong&gt;: Review actual text samples from each cluster to assess quality&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvdn39czkxpd69e27wtay.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvdn39czkxpd69e27wtay.png" alt="BERTopic representatives docs in MLFLow" width="800" height="513"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Metric Comparisons&lt;/strong&gt;: Sort and filter experiments by any logged metric&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flx8t02fgi8dn0ils0twk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flx8t02fgi8dn0ils0twk.png" alt="MLFlow experiments results sorted" width="800" height="212"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;MLflow&lt;/strong&gt; UI (accessible at &lt;a href="http://localhost:12001" rel="noopener noreferrer"&gt;http://localhost:12001&lt;/a&gt; in the repo) becomes your command center for understanding which combinations work best.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Insights and Conclusions
&lt;/h2&gt;

&lt;p&gt;After building and using this system extensively, here are my main takeaways:&lt;/p&gt;

&lt;h3&gt;
  
  
  Speed Through Smart Optimization
&lt;/h3&gt;

&lt;p&gt;By pre-computing embeddings and parallelizing experiments, I can test hundreds of parameter combinations in just tens of minutes. The embedding cache and ProcessPoolExecutor combination is the secret sauce here.&lt;/p&gt;

&lt;h3&gt;
  
  
  Discovering Unexpected Configurations
&lt;/h3&gt;

&lt;p&gt;The system revealed parameter combinations I never would have tried manually. For instance, I now have a production configuration running with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"umap"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"n_neighbors"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"n_components"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No tutorial or LLM ever suggested parameters this high, but for my specific dataset, they work brilliantly.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Human Element Remains Critical
&lt;/h3&gt;

&lt;p&gt;While I'm automating the downstream pipeline (dataset generation, automated experimentation launches), I haven't automated the final configuration export - and I'm not sure it would be desirable. Often, the top-performing combinations are very close in metrics, and I need to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Analyze the results carefully&lt;/li&gt;
&lt;li&gt;Test them in production&lt;/li&gt;
&lt;li&gt;Make a final decision based on real-world performance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The best configuration on paper isn't always the best in practice. Sometimes a slightly lower-scoring configuration produces more interpretable or stable topics in production.&lt;/p&gt;

&lt;h2&gt;
  
  
  Looking Forward
&lt;/h2&gt;

&lt;p&gt;This system has transformed how I approach NLP clustering. Instead of guessing parameters based on vague intuition or generic recommendations, I can systematically explore the parameter space and find what actually works for my data.&lt;/p&gt;

&lt;p&gt;The combination of BERTopic's flexibility, MLflow's tracking capabilities, and Python's parallel processing creates a powerful experimentation platform. Whether you're clustering news headlines, customer feedback, or any other text corpus, this approach can help you find the optimal configuration for your specific use case.&lt;/p&gt;

&lt;p&gt;Feel free to explore the &lt;a href="https://gitlab.com/sartois/bertopic-hyperparameter-tuning" rel="noopener noreferrer"&gt;source code&lt;/a&gt; and adapt it for your own projects.&lt;/p&gt;

</description>
      <category>nlp</category>
      <category>python</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
