<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Kostas Pardalis</title>
    <description>The latest articles on Forem by Kostas Pardalis (@cpard).</description>
    <link>https://forem.com/cpard</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1017586%2F28f2d97f-7a4a-493f-9064-066401c66633.jpeg</url>
      <title>Forem: Kostas Pardalis</title>
      <link>https://forem.com/cpard</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/cpard"/>
    <language>en</language>
    <item>
      <title>How to Build a Deep Research Agent with Pydantic AI</title>
      <dc:creator>Kostas Pardalis</dc:creator>
      <pubDate>Mon, 17 Nov 2025 20:11:04 +0000</pubDate>
      <link>https://forem.com/cpard/how-to-build-a-deep-research-agent-with-pydantic-ai-3ogf</link>
      <guid>https://forem.com/cpard/how-to-build-a-deep-research-agent-with-pydantic-ai-3ogf</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;"HN is amazing for discovery, terrible for structured research."&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If you hang out on Hacker News, you know the feeling: you see a great thread, think &lt;em&gt;"I should come back to this"&lt;/em&gt;, and… never do. A week later, you're trying to answer a question like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;"How has HN's opinion on Rust vs Go changed over time?"&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;"What does HN actually think about LangChain-style agent frameworks?"&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;HN's built-in search is fine for keywords, but not for &lt;strong&gt;questions about themes, opinions, and trends&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;What we really want is to ask higher-level questions about topics, threads, and time windows like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"Show me discussions about &lt;code&gt;e.g. Rust&lt;/code&gt; in the last 6 months."&lt;/li&gt;
&lt;li&gt;"Compare how &lt;code&gt;remote work&lt;/code&gt; was discussed in 2021 vs 2024."&lt;/li&gt;
&lt;li&gt;"Summarize the main arguments for and against &lt;code&gt;LLM agents&lt;/code&gt; across top HN threads."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's where &lt;strong&gt;&lt;a href="https://github.com/typedef-ai/fenic" rel="noopener noreferrer"&gt;fenic&lt;/a&gt;&lt;/strong&gt; comes in: think of it as a &lt;strong&gt;dataframe + context layer&lt;/strong&gt; built for LLM-powered analysis. You declare what data you care about, use regular + semantic transforms to shape it, and then plug that into an agent loop.&lt;/p&gt;

&lt;p&gt;This post walks through how we use fenic to turn a raw Hacker News dataset into a small but powerful &lt;strong&gt;"deep research" agent&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;👉 &lt;strong&gt;Full project:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;fenic repo: &lt;a href="https://github.com/typedef-ai/fenic" rel="noopener noreferrer"&gt;https://github.com/typedef-ai/fenic&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Example code: &lt;a href="https://github.com/typedef-ai/fenic-examples/tree/main/hn_agent" rel="noopener noreferrer"&gt;https://github.com/typedef-ai/fenic-examples/tree/main/hn_agent&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;HN dataset: &lt;a href="https://huggingface.co/datasets/typedef-ai/hacker-news-dataset" rel="noopener noreferrer"&gt;https://huggingface.co/datasets/typedef-ai/hacker-news-dataset&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  2. What we'll build
&lt;/h2&gt;

&lt;p&gt;We'll build a small research agent that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Loads an &lt;strong&gt;HN dataset&lt;/strong&gt; (stories, comments, metadata).&lt;/li&gt;
&lt;li&gt;Lets you &lt;strong&gt;filter and slice discussions&lt;/strong&gt; by topic, time, and signals.&lt;/li&gt;
&lt;li&gt;Uses &lt;strong&gt;LLMs to summarize, compare, and extract themes&lt;/strong&gt; from those slices.&lt;/li&gt;
&lt;li&gt;Wraps it all in a &lt;strong&gt;simple loop&lt;/strong&gt;:
&lt;em&gt;user question → fenic dataframe query → LLM analysis → answer + links back to HN.&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Conceptually, the pipeline looks like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Data layer:&lt;/strong&gt; fenic DataFrames over the HN dataset.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Query layer:&lt;/strong&gt; reusable "research queries" expressed as fenic transformations.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM layer:&lt;/strong&gt; fenic semantic operators (and/or UDFs) to summarize/compare.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent loop:&lt;/strong&gt; something like PydanticAI or your framework of choice to orchestrate.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  3. Setting up fenic
&lt;/h2&gt;

&lt;p&gt;You'll need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Python 3.10+&lt;/li&gt;
&lt;li&gt;An LLM provider key (OpenAI, Anthropic, etc.)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/astral-sh/uv" rel="noopener noreferrer"&gt;uv&lt;/a&gt; package manager&lt;/li&gt;
&lt;li&gt;&lt;code&gt;git&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Install with uv
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Clone the example repo&lt;/span&gt;
git clone https://github.com/typedef-ai/fenic-examples.git
&lt;span class="nb"&gt;cd &lt;/span&gt;fenic-examples/hn_agent

&lt;span class="c"&gt;# Install dependencies with uv&lt;/span&gt;
uv &lt;span class="nb"&gt;sync&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Set your LLM API key(s) and HuggingFace token:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"sk-..."&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;HF_TOKEN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"your-huggingface-token"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Inside this folder you'll find:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A notebook / script that wires together fenic + PydanticAI.&lt;/li&gt;
&lt;li&gt;Helper functions to load the Hacker News dataset.&lt;/li&gt;
&lt;li&gt;A simple agent loop that you can run locally.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you just want to &lt;strong&gt;run it and poke around&lt;/strong&gt;, start there. The rest of this post explains the pieces.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Loading the Hacker News data
&lt;/h2&gt;

&lt;p&gt;The dataset we're using is published as a public Hugging Face Dataset:&lt;/p&gt;

&lt;p&gt;👉 &lt;a href="https://huggingface.co/datasets/typedef-ai/hacker-news-dataset" rel="noopener noreferrer"&gt;https://huggingface.co/datasets/typedef-ai/hacker-news-dataset&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;At a high level it contains:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Stories&lt;/strong&gt;: &lt;code&gt;id&lt;/code&gt;, &lt;code&gt;title&lt;/code&gt;, &lt;code&gt;url&lt;/code&gt;, &lt;code&gt;by&lt;/code&gt;, &lt;code&gt;time&lt;/code&gt;, &lt;code&gt;score&lt;/code&gt;, etc.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Comments&lt;/strong&gt;: &lt;code&gt;id&lt;/code&gt;, &lt;code&gt;parent&lt;/code&gt;, &lt;code&gt;story_id&lt;/code&gt;, &lt;code&gt;by&lt;/code&gt;, &lt;code&gt;time&lt;/code&gt;, &lt;code&gt;text&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Metadata&lt;/strong&gt;: type (&lt;code&gt;story&lt;/code&gt;, &lt;code&gt;comment&lt;/code&gt;), deleted flags, etc.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here's how the actual data loader works in the hn_agent project:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;hn_agent.session&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;get_session&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;load_hn_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;verbose&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Load all Hacker News data from HuggingFace into local tables.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_session&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;base_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hf://datasets/typedef-ai/hacker-news-dataset/data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="c1"&gt;# All 2025 data files to load
&lt;/span&gt;    &lt;span class="n"&gt;files_to_tables&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2025_comments&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;comments&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2025_items&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;items&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2025_stories&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stories&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2025_jobs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;jobs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2025_polls&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;polls&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2025_pollopts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pollopts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2025_users&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;users&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2025_user_submissions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_submissions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2025_item_children&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;item_children&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2025_item_parts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;item_parts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;# Load each file into its own table
&lt;/span&gt;    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;file_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;table_name&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;files_to_tables&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;verbose&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Loading &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;file_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;base_path&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;file_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.parquet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;save_as_table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;overwrite&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;verbose&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;count&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  ✓ Loaded &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; records into &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;table_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The session is configured with semantic support:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fenic.api.session&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Session&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fenic.api.session.config&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SessionConfig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SemanticConfig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;OpenAILanguageModel&lt;/span&gt;

&lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SessionConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;app_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hn_agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;db_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data_dir&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;semantic&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;SemanticConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;language_models&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;OpenAILanguageModel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-5-nano&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;rpm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;tpm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100000&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_or_create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To run the data loading:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;HF_TOKEN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$HF_TOKEN&lt;/span&gt; uv run python &lt;span class="nt"&gt;-m&lt;/span&gt; hn_agent.data.loader
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This downloads ~2.5M comments and ~500K stories from 2025 into a local fenic database. The loader also denormalizes data into optimized lookup tables (&lt;code&gt;comment_to_story&lt;/code&gt;, &lt;code&gt;story_threads&lt;/code&gt;, &lt;code&gt;story_discussions&lt;/code&gt;) that eliminate recursive SQL queries during tool execution.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Defining research queries
&lt;/h2&gt;

&lt;p&gt;Instead of hard-coding one-off scripts, we treat &lt;strong&gt;"research questions"&lt;/strong&gt; as reusable dataframe transformations exposed as MCP tools.&lt;/p&gt;

&lt;p&gt;Here's how the search tool is registered in the actual project:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;fenic.api.functions&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;fc&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fenic.core.types.datatypes&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;StringType&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fenic.core.mcp.types&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ToolParam&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;register_story_search_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tool_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;search_stories&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Register a regex-based HN story search tool.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;catalog&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;catalog&lt;/span&gt;

    &lt;span class="c1"&gt;# Get tables
&lt;/span&gt;    &lt;span class="n"&gt;items&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;items&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;fc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;story&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;comments&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;comments&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;comment_to_story&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;comment_to_story&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Denormalized table
&lt;/span&gt;
    &lt;span class="c1"&gt;# Tool parameters
&lt;/span&gt;    &lt;span class="n"&gt;pattern&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;fc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tool_param&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pattern&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;StringType&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Story-side matches
&lt;/span&gt;    &lt;span class="n"&gt;title_match&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;fc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;coalesce&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;fc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;rlike&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pattern&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;url_match&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;fc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;coalesce&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;fc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;rlike&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pattern&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;story_text_match&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;fc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;coalesce&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;fc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;rlike&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pattern&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;story_hits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;with_column&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title_match&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;title_match&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;with_column&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url_match&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;url_match&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;with_column&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text_match&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;story_text_match&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;with_column&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;match_rank&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;fc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;when&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title_match&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;fc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
            &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;when&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url_match&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;fc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
            &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;when&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text_match&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;fc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
            &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;otherwise&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;999&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title_match&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;fc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url_match&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;fc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text_match&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;fc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;story_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="n"&gt;fc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="n"&gt;fc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;by&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;author&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="n"&gt;fc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;time&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;published_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="n"&gt;fc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="n"&gt;fc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;descendants&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;comment_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="n"&gt;fc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="n"&gt;fc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;match_rank&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Comment-side matches - use denormalized lookup table (no recursion!)
&lt;/span&gt;    &lt;span class="n"&gt;comment_text_match&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;fc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;coalesce&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;fc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;rlike&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pattern&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;matched_comments&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;comments&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;comment_text_match&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;comment_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;limit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Fast lookup using denormalized comment_to_story table
&lt;/span&gt;    &lt;span class="n"&gt;comment_stories&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;matched_comments&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;comment_to_story&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;on&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;comment_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;story_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;drop_duplicates&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;story_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Combine and rank results
&lt;/span&gt;    &lt;span class="n"&gt;unified&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;story_hits&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;union&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;comment_hits&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;drop_duplicates&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;story_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;sorted_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;unified&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sort&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;fc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;match_rank&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;fc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;published_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;desc&lt;/span&gt;&lt;span class="p"&gt;()])&lt;/span&gt;

    &lt;span class="c1"&gt;# Register the tool
&lt;/span&gt;    &lt;span class="n"&gt;catalog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;tool_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tool_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;tool_description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Search Hacker News stories using regex patterns...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;tool_query&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;sorted_results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;tool_params&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="nc"&gt;ToolParam&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pattern&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Regular expression pattern to search for.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;result_limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Results are ranked by relevance:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Title matches (most relevant)&lt;/li&gt;
&lt;li&gt;URL matches&lt;/li&gt;
&lt;li&gt;Story text matches&lt;/li&gt;
&lt;li&gt;Comment matches&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  6. Adding LLM-powered analysis
&lt;/h2&gt;

&lt;p&gt;fenic has first-class &lt;strong&gt;semantic operators&lt;/strong&gt; that wrap LLM calls as dataframe operations (with batching, retries, cost tracking, etc.). That lets you say:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"For each group of comments, ask the model to summarize / classify / extract structure."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Summarize threads with structured output
&lt;/h3&gt;

&lt;p&gt;Here's how the summarize_story tool works with Pydantic models:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pydantic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Field&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;fenic.api.functions.semantic&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;semantic&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;DiscussionTheme&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Represents a theme or topic within a discussion.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Name of the discussion theme&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Concise summary of the theme, viewpoints, and evidence&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;stance_spectrum&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;How opinions vary across this theme&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;representative_comment_ids&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;default_factory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Example comment IDs relevant to this theme&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;off_topic&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;True if this theme is off-topic&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;StorySummary&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Structured summary of a Hacker News story and its discussion.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;tl_dr&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Two-sentence top summary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;story_overview&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Short overview of the story itself&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;key_points&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;default_factory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Key points and takeaways&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;discussion_themes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;DiscussionTheme&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;default_factory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Themes across the discussion&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;variety_present&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Whether discussion splits into distinct topics&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;off_topic_themes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;default_factory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Names of off-topic themes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;risks_or_concerns&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;default_factory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Risks or concerns raised&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;actionables&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;default_factory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Any concrete action items&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;sources&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;default_factory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Referenced comment IDs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;truncated_input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;True if input was truncated due to size&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The summarization uses fenic's &lt;code&gt;semantic.map&lt;/code&gt; with a structured prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;CONCISE_PROMPT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Summarize this Hacker News discussion in {{ language }}:

Story: {{ title }} ({{ domain }})
URL: {{ url }}
Score: {{ score }}, Comments: {{ descendants }}
Published: {{ published_at }}

Discussion thread:
{{ transcript }}

Create a structured summary including:
1. TL;DR (max 2 sentences)
2. Story overview (brief)
3. Key points from discussion
4. Main discussion themes with viewpoints and stances
5. Off-topic themes if present
6. Risks/concerns raised
7. Action items mentioned

{{ extra_instructions }}&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="n"&gt;summary_col&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;semantic&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;CONCISE_PROMPT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;response_format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;StorySummary&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;model_alias&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model_alias&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_output_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;768&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;fc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;fc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;domain&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;fc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;domain&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;published_at&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;fc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;published_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;fc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;descendants&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;fc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;descendants&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;transcript&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;fc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transcript_limited&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;extra_instructions&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;fc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;extra&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;fc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lang&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;with_summary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;with_discussion&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;fc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;story_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;fc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="c1"&gt;# ... other columns
&lt;/span&gt;    &lt;span class="n"&gt;summary_col&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;summary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now each row has a &lt;strong&gt;typed&lt;/strong&gt; &lt;code&gt;summary&lt;/code&gt; object you can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Access nested fields directly&lt;/li&gt;
&lt;li&gt;Aggregate stance ratios by year&lt;/li&gt;
&lt;li&gt;Join back to scores, authors, etc.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  7. Putting it together as an agent loop
&lt;/h2&gt;

&lt;p&gt;fenic takes care of &lt;strong&gt;data + context&lt;/strong&gt;. To make this interactive, we wrap it in a small agent loop using &lt;a href="https://github.com/pydantic/pydantic-ai" rel="noopener noreferrer"&gt;PydanticAI&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Here's the actual research agent from the project:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pydantic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Field&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pydantic_ai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Agent&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pydantic_ai.mcp&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;MCPServerStreamableHTTP&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;DeepResearchReport&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Structured output for research findings.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;method&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;default_factory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Methods used to research&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;key_findings&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;default_factory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Main discoveries&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;themes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;default_factory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Common themes across stories&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;controversies&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;default_factory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Points of disagreement&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;sources&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;default_factory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Story IDs and titles&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;limitations&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;default_factory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Research limitations&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="n"&gt;SYSTEM_PROMPT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
You are a deep research agent analyzing Hacker News discussions via MCP tools.

Available tools:
- search_stories(pattern): Find stories matching a regex pattern
- summarize_story(story_id): Get AI summary of a story and its discussion
- read_story(story_id): Get full story with comment tree (use sparingly)

Research process:
1. Use search_stories to find relevant content (max 5 searches, limit 10 per search)
2. Use summarize_story on the most relevant stories
3. Only use read_story if you need specific metadata not in summaries
4. Synthesize findings across all stories

Important:
- Keep search patterns broad initially, then refine
- Always cite story IDs in your findings
- Don&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t paste raw tool outputs into context
- Focus on patterns and insights across multiple stories

Return a JSON object matching the DeepResearchReport schema.
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;


&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run_research_async&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_stories_to_summarize&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;DeepResearchReport&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Run deep research on a Hacker News topic.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;mcp_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;HN_MCP_URL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:8080/mcp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Create MCP connection
&lt;/span&gt;    &lt;span class="n"&gt;mcp_server&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;MCPServerStreamableHTTP&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;mcp_url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Create agent with structured output
&lt;/span&gt;    &lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai:gpt-5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;system_prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;SYSTEM_PROMPT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;toolsets&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;mcp_server&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;output_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;DeepResearchReport&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;output_retries&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Build user prompt
&lt;/span&gt;    &lt;span class="n"&gt;user_prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Research question: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Please investigate this topic across Hacker News stories and discussions.
Budget: max 5 searches, summarize up to &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;max_stories_to_summarize&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; stories.
Focus on finding diverse perspectives and recurring themes.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The MCP server that exposes the tools is simple:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fenic.api.mcp.server&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;create_mcp_server&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;run_mcp_server_sync&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;hn_agent.session&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;get_session&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;hn_agent.tools.tools&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;register_tools&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;start_server&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;8080&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Start the HTTP MCP server.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_session&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# Register tools first with the same session
&lt;/span&gt;    &lt;span class="nf"&gt;register_tools&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Get all tools from catalog
&lt;/span&gt;    &lt;span class="n"&gt;catalog&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;catalog&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;catalog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;list_tools&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# Create the MCP server with tools
&lt;/span&gt;    &lt;span class="n"&gt;server&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_mcp_server&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;server_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hn_agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tools&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Run the server with HTTP transport
&lt;/span&gt;    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Starting MCP server on http://localhost:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;run_mcp_server_sync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;transport&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;port&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To run everything:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Terminal 1: Start MCP server&lt;/span&gt;
uv run python &lt;span class="nt"&gt;-m&lt;/span&gt; hn_agent.mcp.server

&lt;span class="c"&gt;# Terminal 2: Run research queries&lt;/span&gt;
&lt;span class="nv"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$OPENAI_API_KEY&lt;/span&gt; uv run python &lt;span class="nt"&gt;-m&lt;/span&gt; hn_agent.cli &lt;span class="s2"&gt;"What are concerns about AI safety?"&lt;/span&gt;

&lt;span class="c"&gt;# With options&lt;/span&gt;
&lt;span class="nv"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$OPENAI_API_KEY&lt;/span&gt; uv run python &lt;span class="nt"&gt;-m&lt;/span&gt; hn_agent.cli &lt;span class="nt"&gt;--max-stories&lt;/span&gt; 10 &lt;span class="s2"&gt;"Latest LLM developments"&lt;/span&gt;

&lt;span class="c"&gt;# Output as JSON&lt;/span&gt;
&lt;span class="nv"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$OPENAI_API_KEY&lt;/span&gt; uv run python &lt;span class="nt"&gt;-m&lt;/span&gt; hn_agent.cli &lt;span class="nt"&gt;--json&lt;/span&gt; &lt;span class="s2"&gt;"Rust vs Go discussions"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The pattern is the same:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Dataframe slice → semantic transforms → agent consumes the results.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  8. Where to go next
&lt;/h2&gt;

&lt;p&gt;Everything here is just one concrete instantiation of a more general pattern:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Swap in your own datasets&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Company forum threads&lt;/li&gt;
&lt;li&gt;Support tickets&lt;/li&gt;
&lt;li&gt;Slack exports&lt;/li&gt;
&lt;li&gt;Internal RFCs / design docs&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;Reuse the same fenic primitives&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Filter/slice on metadata (teams, product areas, time windows).&lt;/li&gt;
&lt;li&gt;Use &lt;code&gt;semantic.map&lt;/code&gt; with Pydantic models for structured extraction.&lt;/li&gt;
&lt;li&gt;Use &lt;code&gt;semantic.extract&lt;/code&gt; for pulling typed data from text.&lt;/li&gt;
&lt;li&gt;Use &lt;code&gt;fc.tool_param&lt;/code&gt; to create parameterized MCP tools.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;Combine with other fenic examples&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use the &lt;strong&gt;semantic join&lt;/strong&gt; examples to correlate HN threads with logs, incidents, or docs.&lt;/li&gt;
&lt;li&gt;Use the &lt;strong&gt;clustering capabilities&lt;/strong&gt; to group similar discussions together.&lt;/li&gt;
&lt;li&gt;Use the &lt;strong&gt;Hugging Face Datasets integration&lt;/strong&gt; to hydrate other versioned datasets into DataFrames with one line.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;If you want to dig deeper:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;fenic core:&lt;/strong&gt; &lt;a href="https://github.com/typedef-ai/fenic" rel="noopener noreferrer"&gt;https://github.com/typedef-ai/fenic&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;This HN agent example:&lt;/strong&gt; &lt;a href="https://github.com/typedef-ai/fenic-examples/tree/main/hn_agent" rel="noopener noreferrer"&gt;https://github.com/typedef-ai/fenic-examples/tree/main/hn_agent&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hacker News dataset on HF:&lt;/strong&gt; &lt;a href="https://huggingface.co/datasets/typedef-ai/hacker-news-dataset" rel="noopener noreferrer"&gt;https://huggingface.co/datasets/typedef-ai/hacker-news-dataset&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I'd love to hear how you adapt this pattern, whether it's for HN, your company's internal knowledge, or other messy discussion data.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>tutorial</category>
      <category>agents</category>
      <category>python</category>
    </item>
    <item>
      <title>DX UX = U DX UX =</title>
      <dc:creator>Kostas Pardalis</dc:creator>
      <pubDate>Thu, 05 Sep 2024 20:40:44 +0000</pubDate>
      <link>https://forem.com/cpard/dx-ux-u-dx-ux--e94</link>
      <guid>https://forem.com/cpard/dx-ux-u-dx-ux--e94</guid>
      <description>&lt;p&gt;User Experience is primarily concerned about guiding the user to a desired outcome in the most optimal way, optimizing for time and margin of error. That’s why the term journey is heavily used within the context of UX design.&lt;/p&gt;

&lt;p&gt;Developer Experience on the other hand, is not about guiding the user, but designing the right abstractions and choosing what part of the system complexity to expose to the developer.&lt;/p&gt;

&lt;p&gt;I tend to think of UX as building guardrails and DX as crafting a toolkit. While UX aims to create a smooth, intuitive path for users, DX focuses on providing developers with powerful, flexible tools that allow them to build efficiently and effectively.&lt;/p&gt;

&lt;p&gt;In UX, we’re often trying to anticipate user needs and minimize cognitive load. We create interfaces that are self-explanatory and workflows that feel natural. The goal is to make the product as easy to use as possible, even for first-time users.&lt;/p&gt;

&lt;p&gt;DX, however, is about empowering developers. It’s about creating APIs, frameworks, and development environments that are not necessarily simple on the surface, but are logically structured and well-documented. Good DX allows developers to harness complex functionality without getting bogged down in unnecessary details.&lt;/p&gt;

&lt;p&gt;Both UX and DX share a common goal of efficiency, but they approach it from different angles. UX seeks to reduce friction for end-users, while DX aims to maximize productivity for developers. In the best scenarios, great DX leads to better UX, as developers are able to create more robust, performant, and feature-rich applications.&lt;/p&gt;

&lt;p&gt;In DX, the tools we craft must feel native to the way each type of engineer understands the world. It’s crucial to recognize that different engineering disciplines often operate with distinct mental models and vocabularies, even when working with seemingly similar concepts. For instance, a data engineer and an application engineer may not share the same semantic understanding, despite potentially using identical syntax.&lt;/p&gt;

&lt;p&gt;Consider the term “partition” as an example. A data engineer might conceptualize partitions in terms of distributing large datasets across multiple storage units for efficient processing and querying. In contrast, an application engineer working with Kafka might think of partitions as a way to organize and parallelize message streams.&lt;/p&gt;

&lt;p&gt;While the word is the same, the underlying concepts and implications differ significantly based on the engineer’s domain of expertise.&lt;/p&gt;

&lt;p&gt;Therefore, when designing tools and abstractions for DX, we must tailor them to align with the specific mental models and workflow patterns of each engineering discipline. This approach ensures that our tools not only provide functionality but also resonate with the intuitive understanding and problem-solving approaches of the engineers using them.&lt;/p&gt;

&lt;p&gt;failing in doing a good job with this alignment, leads to the inefficiency that plagues much of today’s compute infrastructure, the frustration that practitioners and a steep learning curve that ends up being the reason we are not seeing the growth in new practitioners entering the market that we could otherwise expect.&lt;/p&gt;

&lt;p&gt;The tools considered today as the foundation of the upcoming AI and data revolution were built in a different era, designed for a vastly different user base that cannot address the diverse range of practitioners and use cases we have today.&lt;/p&gt;

&lt;p&gt;If we want to realize the potential of AI and data, we must fundamentally rethink the tools we have today, to align with the needs and skills of current and future users in mind.&lt;/p&gt;

&lt;p&gt;Failing to do this, will risk the success of these emerging technologies.&lt;/p&gt;

</description>
      <category>dx</category>
      <category>ux</category>
    </item>
    <item>
      <title>Why you should keep an eye on Apache DataFusion and its community.</title>
      <dc:creator>Kostas Pardalis</dc:creator>
      <pubDate>Tue, 09 Jul 2024 05:32:43 +0000</pubDate>
      <link>https://forem.com/cpard/why-you-should-keep-an-eye-on-apache-datafusion-and-its-community-4e6g</link>
      <guid>https://forem.com/cpard/why-you-should-keep-an-eye-on-apache-datafusion-and-its-community-4e6g</guid>
      <description>&lt;p&gt;On June 24, 2025, the first San Francisco Bay Area DataFusion meetup happened. I had the opportunity to help with the organization of the event and also attend.&lt;/p&gt;

&lt;p&gt;The event had a lot of content from six different companies. These companies ranged from startups to scale-ups and big Fortune 500 companies. Leaving the event, I felt I had experienced something significant, and I want to share it with you.&lt;/p&gt;

&lt;p&gt;And trust me, you don't want to miss out on this!&lt;/p&gt;

&lt;h2&gt;
  
  
  What are you talking about, dude?
&lt;/h2&gt;

&lt;p&gt;In case you don't know what &lt;a href="https://github.com/apache/datafusion" rel="noopener noreferrer"&gt;Apache DataFusion&lt;/a&gt; is, here's the high-level blurb.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;DataFusion is a very fast, extensible query engine for building high-quality data-centric systems in &lt;a href="http://rustlang.org/" rel="noopener noreferrer"&gt;Rust&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It's a pretty good description of what technically DataFusion is, but like many amazing open source projects, it sells short itself.&lt;/p&gt;

&lt;p&gt;Here are a few reasons why I say that, and by the end of this list, you will have all the reasons why you should pay close attention to the future of this project.&lt;/p&gt;

&lt;h2&gt;
  
  
  First, the technology
&lt;/h2&gt;

&lt;p&gt;Databases are notoriously hard to build and get to market. So hard that there's a whole graveyard of database systems that were built and never made it into a product.&lt;/p&gt;

&lt;p&gt;The reason for that is simple. Databases are just very complex systems.&lt;/p&gt;

&lt;p&gt;They stand up there together with operating systems and compilers in terms of technical complexity.&lt;/p&gt;

&lt;p&gt;Operating systems abstracted this complexity with the genius of Linux. There's the kernel and then a whole set of layers that build functionality in both user-land and kernel-land.&lt;/p&gt;

&lt;p&gt;Similarly&lt;a href="https://llvm.org/" rel="noopener noreferrer"&gt;, LLVM&lt;/a&gt; revolutionized the world of programming languages and compilers. Since its creation, we've seen many new languages being created of increased complexity.&lt;/p&gt;

&lt;p&gt;But databases are still waiting for their LLVM moment. Until today, if you wanted to build a database system, you pretty much had to build every piece of it.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Design the grammar of the query language&lt;/li&gt;
&lt;li&gt;Build a parser&lt;/li&gt;
&lt;li&gt;Figure out an intermediate representation&lt;/li&gt;
&lt;li&gt;Logical plans&lt;/li&gt;
&lt;li&gt;Optimizations of logical plans&lt;/li&gt;
&lt;li&gt;Query optimizers&lt;/li&gt;
&lt;li&gt;Physical plans&lt;/li&gt;
&lt;li&gt;Execution engines&lt;/li&gt;
&lt;li&gt;Storage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And all that while fighting constantly with performance and correctness.&lt;/p&gt;

&lt;p&gt;It can be done, but it takes a lot of time, and in the world of technology, time is the only resource you don't really have.&lt;/p&gt;

&lt;p&gt;As a result, most companies that tried to market a new database, didn't have enough time to figure out what the market needed.&lt;/p&gt;

&lt;p&gt;DataFusion is changing this.&lt;/p&gt;

&lt;p&gt;Its design lets a team focus on a specific part of the database system they want to change. They can then reuse the rest, which greatly reduces the time it takes to get the product to the market.&lt;/p&gt;

&lt;p&gt;Taking a look at the companies who are using DataFusion today, is a testament to that claim.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://lancedb.com/" rel="noopener noreferrer"&gt;LanceDB&lt;/a&gt;, &lt;a href="http://cube.dev/" rel="noopener noreferrer"&gt;Cube.dev&lt;/a&gt;, &lt;a href="https://www.influxdata.com/" rel="noopener noreferrer"&gt;InfluxData&lt;/a&gt;, &lt;a href="https://www.denormalized.io/" rel="noopener noreferrer"&gt;Denormalized&lt;/a&gt;, and &lt;a href="https://greptime.com/" rel="noopener noreferrer"&gt;Greptime&lt;/a&gt; are building completely different products. What they have in common, though, is that their products are a database system at their core and they are also using DataFusion to build them.&lt;/p&gt;

&lt;p&gt;Each project is innovating on a different part of a database system. They also are reusing the rest as DataFusion provides them out of the box.&lt;/p&gt;

&lt;h2&gt;
  
  
  The community
&lt;/h2&gt;

&lt;p&gt;DataFusion is a young open-source project, but has managed to build a very healthy community.&lt;/p&gt;

&lt;p&gt;That was evident at the meetup event, where everyone was there to share knowledge and seek opportunities to contribute back.&lt;/p&gt;

&lt;p&gt;Building such a community is not easy and it's primarily the result of the hard work a very small number of people are doing. &lt;a href="https://x.com/andrewlamb1111" rel="noopener noreferrer"&gt;Andrew Lamb&lt;/a&gt; and &lt;a href="https://x.com/andygrove_io" rel="noopener noreferrer"&gt;Andry Grove&lt;/a&gt; have done an amazing job so far, and they deserve recognition for that.&lt;/p&gt;

&lt;p&gt;Toxicity and bad governance is what kills many open-source projects, but what I've experienced from the community so far, makes me feel very optimistic about the future.&lt;/p&gt;

&lt;p&gt;Having said that, the work of these folks shouldn't be taken for granted. Everyone who benefits from the project and the community, should try to support it in whatever way they can.&lt;/p&gt;

&lt;h2&gt;
  
  
  Governance &amp;amp; ownership
&lt;/h2&gt;

&lt;p&gt;DataFusion is blessed to be an open-source project that doesn't have a single company maintaining it.&lt;/p&gt;

&lt;p&gt;The open-core model of monetizing software has left a very bitter taste in the mouths of many practitioners. Hashicorp and Databricks are just a small example of that.&lt;/p&gt;

&lt;p&gt;We need a different model for building monetary value over open-source. Projects like Apache Arrow and Apache DataFusion are a great example of how a better future could look like.&lt;/p&gt;

&lt;p&gt;All the companies I mentioned in the previous section benefit from DataFusion and contribute back to the project. They also monetize their technology and build their moats, without being antagonistic to the project.&lt;/p&gt;

&lt;h2&gt;
  
  
  The stars are aligned
&lt;/h2&gt;

&lt;p&gt;Finally, the market is looking for solutions to problems that will require a lot of innovation to happen in data management systems.&lt;/p&gt;

&lt;p&gt;The rise of new use cases like AI and ML are pushing existing solutions to their limits.&lt;/p&gt;

&lt;p&gt;We need to build and we don't have the luxury of iterating over 5+ years to just get a demo out there to the market.&lt;/p&gt;

&lt;p&gt;DataFusion and the rest of the Arrow ecosystem is the foundation that will enable that, and it's already happening.&lt;/p&gt;

&lt;p&gt;The companies that presented at the Bay Area meetup collectively received over $200 million in funding.&lt;/p&gt;

&lt;p&gt;All are using DataFusion for critical parts of their products and contribute back to the project.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusions
&lt;/h2&gt;

&lt;p&gt;The above are just a few of the reasons that make DataFusion such a special project. It's still early, but the future looks really bright.&lt;/p&gt;

&lt;p&gt;I hope I convinced you to keep an eye on the project, and if not, reach out and let me know why. I'm happy to hear your thoughts.&lt;/p&gt;

&lt;p&gt;I'll leave you for now with a prediction that the 1,000 projects built on DataFusion is not that far away!&lt;/p&gt;

</description>
      <category>datafusion</category>
      <category>database</category>
      <category>opensource</category>
    </item>
    <item>
      <title>A glimpse into the future of data processing infrastructure.</title>
      <dc:creator>Kostas Pardalis</dc:creator>
      <pubDate>Thu, 02 May 2024 18:49:22 +0000</pubDate>
      <link>https://forem.com/cpard/a-glimpse-into-the-future-of-data-processing-infrastructure-45ne</link>
      <guid>https://forem.com/cpard/a-glimpse-into-the-future-of-data-processing-infrastructure-45ne</guid>
      <description>&lt;p&gt;Three weeks ago, VeloxCon took place in San Jose. The event was a great opportunity for people who are interested in execution engines and data processing at scale to learn about the current state of the project.&lt;/p&gt;

&lt;p&gt;Most importantly, though, it was an amazing opportunity to get a glimpse of what the future of data processing will be like. From what we saw at the event, this future is very exciting!&lt;/p&gt;

&lt;p&gt;Let's get into more details of what happened at the event and why it's important.&lt;/p&gt;

&lt;h1&gt;
  
  
  First, what is Velox?
&lt;/h1&gt;

&lt;p&gt;Velox is an open-source unified execution engine created and open-sourced by Meta, aiming to commoditize execution in data management systems.&lt;/p&gt;

&lt;p&gt;You can think of the execution engine as the component of a data management system that is responsible for processing the data.&lt;/p&gt;

&lt;p&gt;This part is usually one of the most time-consuming to build and also the one that is the most demanding regarding correctness. We might assume that whenever we run an &lt;em&gt;SUM&lt;/em&gt; aggregation function, the result will always be correct. This is only possible because dedicated engineers invest countless hours guaranteeing that your database functions correctly.&lt;/p&gt;

&lt;p&gt;Why is it that Velox could commoditize execution in data management systems? Because execution is such a hard problem to solve. If we had a library that always does the right thing and also does it in the best way possible, we could build the other services around it..&lt;/p&gt;

&lt;p&gt;Why does this matter? Because if we can succeed in that, then systems and database engineers can design modular data management systems and iterate faster. As a result, users of these systems can enjoy better products and more innovation reaching them faster than ever before.&lt;/p&gt;

&lt;h1&gt;
  
  
  What happened at VeloxCon this year
&lt;/h1&gt;

&lt;p&gt;The conference happened in the period of two days and it was packed with interesting, and as expected, very technical talks. The high-level structure of the event was the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Day 1 was all about updates on the current state of the project

&lt;ul&gt;
&lt;li&gt;New features that have been delivered to Velox&lt;/li&gt;
&lt;li&gt;Updates on who is currently using it in production and how&lt;/li&gt;
&lt;li&gt;Open Source Project management and governance updates&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Day 2 was all about the future of Velox and data management systems

&lt;ul&gt;
&lt;li&gt;A shift into use cases and workloads for data management systems&lt;/li&gt;
&lt;li&gt;A lot of hardware, which might sound surprising for such a conference, however, it's one of the most interesting signals for the future of data management&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;Every single presentation that was presented at VeloxCon deserves its own post. Instead, I'm going to share the takeaways that I believe are the most important. I'd suggest you then go to the &lt;a href="https://www.youtube.com/playlist?list=PLJvBe8nQAEsEBSoUY0lRFVZr2_YeHYkUR" rel="noopener noreferrer"&gt;YouTube channel&lt;/a&gt; of the conference and watch all of the presentations.&lt;/p&gt;

&lt;h1&gt;
  
  
  Take #1: For analytical workloads, it's all about performance optimization
&lt;/h1&gt;

&lt;p&gt;Today, we know how to manage these workloads. We also need to make them accessible to everyone. The market demands their commoditization. For analytical workloads at any scale, it's all about optimization.&lt;/p&gt;

&lt;p&gt;Optimization comes in two forms. I would argue that these two forms are just different sides of the same coin; one is performance optimization and the other is cost optimization.&lt;/p&gt;

&lt;p&gt;The good news is that there's huge opportunity for delivering value here.&lt;/p&gt;

&lt;p&gt;My take is that in the coming years we will see more performance and auto-tuning coming from the systems themselves. As a result, practitioners will spend more time building than operating.&lt;/p&gt;

&lt;h2&gt;
  
  
  Related talks
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.youtube.com/watch?v=b0lNKYrkYcY&amp;amp;list=PLJvBe8nQAEsEBSoUY0lRFVZr2_YeHYkUR&amp;amp;index=3" rel="noopener noreferrer"&gt;Prestissimo Batch Efficiency at Meta - Amit Dutta, Meta&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.youtube.com/watch?v=2O6_08A-vLo&amp;amp;list=PLJvBe8nQAEsEBSoUY0lRFVZr2_YeHYkUR&amp;amp;index=7" rel="noopener noreferrer"&gt;What's new in Velox - Jimmy Lu, Meta&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.youtube.com/watch?v=qO6uUdz7GNY&amp;amp;list=PLJvBe8nQAEsEBSoUY0lRFVZr2_YeHYkUR&amp;amp;index=19" rel="noopener noreferrer"&gt;Velox I/O optimizations - Deepak Majeti, IBM&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Take #2: Defining and measuring performance is a very hard problem
&lt;/h1&gt;

&lt;p&gt;This is not just an engineering problem. It's a market problem, too.&lt;/p&gt;

&lt;p&gt;TPCH and TPCDS are useful for understanding the operators a data management system supports. However, they are not enough to decide if a system is the best for your workloads.&lt;/p&gt;

&lt;p&gt;Just take a look at the &lt;a href="https://www.youtube.com/watch?v=2O6_08A-vLo" rel="noopener noreferrer"&gt;presentation about what's new in Velox&lt;/a&gt; to see the optimizations that are being discussed and the improvements they are talking about.&lt;/p&gt;

&lt;p&gt;The space for possible optimizations for data processing is just enormous. So how do you choose where to focus and what to go after?&lt;/p&gt;

&lt;p&gt;My take is that although TPC-* benchmarks are important tools, there's a lot more to be done on this front. Unfortunately, bench-marketing has been a plague in this industry and it's quite easy for benchmarking suites to turn into marketing tools without much substance.&lt;/p&gt;

&lt;p&gt;Because of that, a different approach to benchmarking is important. It should be more of defining frameworks and tooling to create custom benchmarks that fit the workloads of each user.&lt;/p&gt;

&lt;p&gt;Now, the question of when to benchmark a data processing system will not be about whether it's complete and accurate. Systems like Velox will make sure of that. Instead, we'll focus on how well a particular system performs based on benchmarks that come from the user's actual work.&lt;/p&gt;

&lt;h2&gt;
  
  
  Related Talks
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.youtube.com/watch?v=b0lNKYrkYcY&amp;amp;list=PLJvBe8nQAEsEBSoUY0lRFVZr2_YeHYkUR&amp;amp;index=3" rel="noopener noreferrer"&gt;Prestissimo Batch Efficiency at Meta - Amit Dutta, Meta&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.youtube.com/watch?v=H7L5W6Vio3U&amp;amp;list=PLJvBe8nQAEsEBSoUY0lRFVZr2_YeHYkUR&amp;amp;index=8" rel="noopener noreferrer"&gt;An update on the Apache Gluten project incubator and its use of Velox - Binwei Yang&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Take #3: Databricks moats are holding strong
&lt;/h1&gt;

&lt;p&gt;When I first learned about the &lt;a href="https://github.com/apache/incubator-gluten" rel="noopener noreferrer"&gt;Gluten project&lt;/a&gt; from Intel, I thought Databricks was going to be in trouble.&lt;/p&gt;

&lt;p&gt;Photon was a great advantage for them when they split from Apache Spark. If Apache Spark now gets a similar execution engine, it'll be harder for people to switch to Databricks.&lt;/p&gt;

&lt;p&gt;Especially if something like EMR Spark gets an execution engine comparable to Photon.&lt;/p&gt;

&lt;p&gt;We are far from seeing widespread production use of WebAssembly. Although it is gaining momentum, and we have seen initial production deployments, there are still significant gaps that need to be addressed.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Initially, there were missing Spark functions in Velox. However, the gap is narrowing quickly.&lt;/li&gt;
&lt;li&gt;Second, implementing a function is one thing. Guaranteeing that the implementation is semantically, and performance equivalent to Spark is another thing. I'll get back to this later because there are some very interesting insights from the event regarding workload migrations to new engines.&lt;/li&gt;
&lt;li&gt;Third, PySpark and dataframe support is not there yet and that's a big issue as these two APIs have become an important driver of Spark adoption.&lt;/li&gt;
&lt;li&gt;Finally, UDFs for Spark need to be figured out. A lot of business logic is done as User-Defined Functions on Spark. Moving these functions to a new system while making sure they still work and perform better isn't easy.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We are getting closer to having a similar to Photon execution engine on Apache Spark, however it will take more time before we have something that can threaten Databricks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Related Talks
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.youtube.com/watch?v=H7L5W6Vio3U&amp;amp;list=PLJvBe8nQAEsEBSoUY0lRFVZr2_YeHYkUR&amp;amp;index=8" rel="noopener noreferrer"&gt;An update on the Apache Gluten project incubator and its use of Velox - Binwei Yang&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.youtube.com/watch?v=7pXOAjSITYs&amp;amp;list=PLJvBe8nQAEsEBSoUY0lRFVZr2_YeHYkUR&amp;amp;index=10" rel="noopener noreferrer"&gt;Accelerating Spark at Microsoft using Gluten &amp;amp; Velox - Swinky Mann and Zhen Li, Microsoft&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.youtube.com/watch?v=pQ4bMyXXLss&amp;amp;list=PLJvBe8nQAEsEBSoUY0lRFVZr2_YeHYkUR&amp;amp;index=9" rel="noopener noreferrer"&gt;Unlocking Data Query Performance at Pinterest - Zaheen Aziz&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Take #4: Databricks might not be in trouble, but maybe Snowflake is?
&lt;/h1&gt;

&lt;p&gt;One of the interesting things I learned at the event was that Presto's integration of Velox is much more mature than that of Spark.&lt;/p&gt;

&lt;p&gt;Although I initially expected Databricks to be the primary focus, it turned out to be different. The implementation of Velox in Spark is not progressing as quickly as it is in Presto.&lt;/p&gt;

&lt;p&gt;Currently, Prestissimo inside Meta has been fast replacing the old good Presto and it has reached a great level of maturity.&lt;/p&gt;

&lt;p&gt;Traditionally, Presto is employed for conducting interactive analyses on a data lake. It operates similarly to a data warehouse such as Snowflake, except it uses a more flexible infrastructure and also offers query federation capabilities.&lt;/p&gt;

&lt;p&gt;At the same time, the performance improvements that are being reported for Prestissimo are quite amazing. Being more performant means more flexibility for trading off cost over latency while enabling new workloads that were not possible before. Previously, it was uncommon to perform ETL on Presto, but now it is becoming increasingly common.&lt;/p&gt;

&lt;p&gt;If Presto performs as well as Snowflake while having a more open design, what will stop people from using it instead of Snowflake?&lt;/p&gt;

&lt;p&gt;I would argue here that the main obstacle for this to happen is the operational burden on Presto and Trino. Setting up and running these systems is often a harder task than doing the same for Spark.&lt;/p&gt;

&lt;p&gt;Migrating away from Snowflake might not be a difficult decision. Systems like Athena and Starburst Galaxy are improving their developer experience, and the performance of systems like Velox is on par with Snowflake.&lt;/p&gt;

&lt;p&gt;This makes me think that data warehousing will become a commodity much quicker than Spark and its workloads.&lt;/p&gt;

&lt;h2&gt;
  
  
  Related Talks
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.youtube.com/watch?v=BmxqcpYeviQ&amp;amp;list=PLJvBe8nQAEsEBSoUY0lRFVZr2_YeHYkUR&amp;amp;index=6" rel="noopener noreferrer"&gt;Parquet and Iceberg 2.0 Support - Ying Su, IBM&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.youtube.com/watch?v=hx4pGdb3i04&amp;amp;list=PLJvBe8nQAEsEBSoUY0lRFVZr2_YeHYkUR&amp;amp;index=5" rel="noopener noreferrer"&gt;Prestissimo at IBM - Aditi Pandit, IBM&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.youtube.com/watch?v=b0lNKYrkYcY&amp;amp;list=PLJvBe8nQAEsEBSoUY0lRFVZr2_YeHYkUR&amp;amp;index=3" rel="noopener noreferrer"&gt;Prestissimo Batch Efficiency at Meta - Amit Dutta, Meta&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Take #5: We need more developer tooling around data infrastructure
&lt;/h1&gt;

&lt;p&gt;In the past, data tools were built for data analysts, not data engineers. Data quality platforms were designed with attractive user interfaces, while catalogs were created to rival the user experience of top-notch SaaS applications.&lt;/p&gt;

&lt;p&gt;However, the future of data infrastructure is going to be a little bit different.&lt;/p&gt;

&lt;p&gt;Like in app development, the key to making data teams more productive lies not in UX, but in DX.&lt;/p&gt;

&lt;p&gt;There's a great need for tooling for developers who are responsible for building and maintaining data platforms. One great example of that is the lack of good fuzzers for testing data platforms.&lt;/p&gt;

&lt;p&gt;Jimmy Lu explained how Velox uses fuzzers to make sure Prestissimo-Velox performs as expected, compared to Presto's standard execution engine.&lt;/p&gt;

&lt;p&gt;This type of tooling is important for any team that is trying to do a migration. Not just from one vendor to another, but even between different versions of the same piece of infrastructure.&lt;/p&gt;

&lt;p&gt;Just ask anyone involved in migrating from Hive to anything else how long it took to properly migrate away from it with confidence.&lt;/p&gt;

&lt;h2&gt;
  
  
  Talks to check
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.youtube.com/watch?v=H7L5W6Vio3U&amp;amp;list=PLJvBe8nQAEsEBSoUY0lRFVZr2_YeHYkUR&amp;amp;index=8" rel="noopener noreferrer"&gt;An update on the Apache Gluten project incubator and its use of Velox - Binwei Yang&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.youtube.com/watch?v=b0lNKYrkYcY&amp;amp;list=PLJvBe8nQAEsEBSoUY0lRFVZr2_YeHYkUR&amp;amp;index=3" rel="noopener noreferrer"&gt;Prestissimo Batch Efficiency at Meta - Amit Dutta, Meta&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.youtube.com/watch?v=pQ4bMyXXLss&amp;amp;list=PLJvBe8nQAEsEBSoUY0lRFVZr2_YeHYkUR&amp;amp;index=9" rel="noopener noreferrer"&gt;Unlocking Data Query Performance at Pinterest - Zaheen Aziz&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Take #6: There's a tectonic swift in the importance of data workloads
&lt;/h1&gt;

&lt;p&gt;Nimble's presentation has over 7K views, while others have just over 1K.&lt;/p&gt;

&lt;p&gt;Parquet has been the foundation of large-scale data processing for over 10 years. Now, people are starting to build something to replace it. This is significant for the industry.&lt;/p&gt;

&lt;p&gt;ML workloads are becoming more and more important and mainstream than they used to be. That's the shift we are talking about here.&lt;/p&gt;

&lt;p&gt;Analytical workloads will keep growing, but the market demands that ML workloads run more efficiently and reach more people.&lt;/p&gt;

&lt;p&gt;To connect with what we've said earlier about analytical workloads, markets are looking for infrastructure that can deliver efficiencies for them. At the same time, it is looking for technologies that can enable ML workloads at scale, which are new types of workloads and use cases.&lt;/p&gt;

&lt;p&gt;What is an interesting observation, though, is that some of the fundamental parts of data infrastructure do not have to change to serve both workloads. Velox is a great example of this!&lt;/p&gt;

&lt;h2&gt;
  
  
  Related Talks
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.youtube.com/watch?v=ACyhL9rdv-s&amp;amp;list=PLJvBe8nQAEsEBSoUY0lRFVZr2_YeHYkUR&amp;amp;index=13" rel="noopener noreferrer"&gt;Velox Wave and Accelerators - Orri Erling, Meta&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.youtube.com/watch?v=CXxDNWrdEyk&amp;amp;list=PLJvBe8nQAEsEBSoUY0lRFVZr2_YeHYkUR&amp;amp;index=16" rel="noopener noreferrer"&gt;Theseus: A composable, distributed, hardware-agnostic processing engine - Voltron Data&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.youtube.com/watch?v=bISBNVtXZ6M&amp;amp;list=PLJvBe8nQAEsEBSoUY0lRFVZr2_YeHYkUR&amp;amp;index=17" rel="noopener noreferrer"&gt;Nimble, A New Columnar File Format - Yoav Helfman, Meta&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Take #7: Hardware is still the catalyst for innovation in data processing
&lt;/h1&gt;

&lt;p&gt;Hardware has always been a catalyst for data processing technologies. (If you don't believe me, just listen to &lt;a href="https://datastackshow.com/podcast/system-evolution-from-hadoop-to-rocksdb-with-dhruba-borthakur-of-rockset/" rel="noopener noreferrer"&gt;Dhruba Borthakur&lt;/a&gt;, the creator of RocksDB).&lt;/p&gt;

&lt;p&gt;Over time, database systems have changed a lot. First, there was Hadoop, which used cheap hard disk drives (HDDs). Then, low-latency systems like rocksdb came along because of cheap solid-state drives (SSDs). Now, we have cloud warehouses, which are possible because of the massive and cheap block storage on the cloud.&lt;/p&gt;

&lt;p&gt;The main difference between the above and what is happening today, though, is that today the main driver of innovation is not storage but processing. GPUs, TPUs, FPGAs, any sort of on-chip accelerator is what is going to drive the next wave of innovation in data management systems.&lt;/p&gt;

&lt;p&gt;VeloxCon's second day focused on hardware accelerators. Talks covered what hardware vendors are bringing and what query engines need to support this new hardware.&lt;/p&gt;

&lt;h2&gt;
  
  
  Related Talks
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.youtube.com/watch?v=ACyhL9rdv-s&amp;amp;list=PLJvBe8nQAEsEBSoUY0lRFVZr2_YeHYkUR&amp;amp;index=13" rel="noopener noreferrer"&gt;Velox Wave and Accelerators - Orri Erling, Meta&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.youtube.com/watch?v=NwGjRdFqghI&amp;amp;list=PLJvBe8nQAEsEBSoUY0lRFVZr2_YeHYkUR&amp;amp;index=15" rel="noopener noreferrer"&gt;Velox, Offloading work to accelerators - Sergei Lewis, Rivos&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.youtube.com/watch?v=ghhiBE23kqg&amp;amp;list=PLJvBe8nQAEsEBSoUY0lRFVZr2_YeHYkUR&amp;amp;index=14" rel="noopener noreferrer"&gt;NeuroBlade's SPU Accelerates Velox by 10x - Krishna Maheshwari, Neuroblade&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Outro
&lt;/h1&gt;

&lt;p&gt;VeloxCon has allowed me to take a glimpse of a future that is fast-coming.&lt;/p&gt;

&lt;p&gt;There are many challenges, but with them also many amazing opportunities for building new technologies and delivering immense amounts of value.&lt;/p&gt;

&lt;p&gt;I personally cannot wait to see what will happen in the next couple of months with Velox and the industry. Exciting times ahead!&lt;/p&gt;

</description>
      <category>database</category>
      <category>bigdata</category>
      <category>snowflake</category>
      <category>spark</category>
    </item>
    <item>
      <title>WTF is a Vector Database?</title>
      <dc:creator>Kostas Pardalis</dc:creator>
      <pubDate>Sun, 23 Apr 2023 04:57:53 +0000</pubDate>
      <link>https://forem.com/cpard/wtf-is-a-vector-database-3l01</link>
      <guid>https://forem.com/cpard/wtf-is-a-vector-database-3l01</guid>
      <description>&lt;p&gt;It’s obviously a database, right? 😄 but how is it different from whatever you’ve heard until now that is a database? Like MySQL or PostgreSQL? &lt;/p&gt;

&lt;p&gt;Let’s start by going through the basics and trust me, by the end of this you will have a much better understanding of WTF is a Vector Database!&lt;/p&gt;

&lt;h1&gt;
  
  
  WTF is a Vector Database?
&lt;/h1&gt;

&lt;h2&gt;
  
  
  It’s a database
&lt;/h2&gt;

&lt;p&gt;I’ll perform some plagiarism here but it’s better to hear from someone who knows much better than me of what a database is.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.devtools.wtf%2FScreenshot_2023-03-31_at_7.02.46_PM.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.devtools.wtf%2FScreenshot_2023-03-31_at_7.02.46_PM.png" alt="“[01 Course Intro &amp;amp; Relational Model - Intro to database systems (15-445/645)](https://15445.courses.cs.cmu.edu/fall2022/slides/01-introduction.pdf)” Andy Pavlo, Carnegie Mellon University." width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;“&lt;a href="https://15445.courses.cs.cmu.edu/fall2022/slides/01-introduction.pdf" rel="noopener noreferrer"&gt;01 Course Intro &amp;amp; Relational Model - Intro to database systems (15-445/645)&lt;/a&gt;” Andy Pavlo, Carnegie Mellon University.&lt;/p&gt;

&lt;p&gt;Vector databases do organize inter-related data that models some aspect of the real-world! They are not a core component of most computer applications yet but maybe if the AI revolution proves its current hype, they might be.&lt;/p&gt;

&lt;p&gt;Databases usually come packaged as Database Management Systems (DBMS) you probably have also heard this term already and it’s important to keep in mind the difference between a database and a DBMS.&lt;/p&gt;

&lt;p&gt;A set of CSV files in your file system can definitely be a database. It absolutely follows the above definition. It can contain information that is inter-related and that models some aspect of the real-world.&lt;/p&gt;

&lt;h2&gt;
  
  
  Database to DBMS
&lt;/h2&gt;

&lt;p&gt;But what turns a database into a DBMS is what makes databases hard in general. A DBMS includes functionality for:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Ensuring Data Integrity&lt;/li&gt;
&lt;li&gt;Data manipulation and access, i.e. add new data&lt;/li&gt;
&lt;li&gt;Durability, i.e. what if the database crashes?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If in the above functionality we also add APIs for generic software to interact with the database for storing and processing data, then we have the definition of what a DBMS is.&lt;/p&gt;

&lt;h2&gt;
  
  
  Data Models &amp;amp; Databases
&lt;/h2&gt;

&lt;p&gt;Let’s see what CMU-DB and Prof. Pavlo have to say about data models.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.devtools.wtf%2FScreenshot_2023-03-31_at_8.52.00_PM.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.devtools.wtf%2FScreenshot_2023-03-31_at_8.52.00_PM.png" alt="“[01 Course Intro &amp;amp; Relational Model - Intro to database systems (15-445/645)](https://15445.courses.cs.cmu.edu/fall2022/slides/01-introduction.pdf)” Andy Pavlo, Carnegie Mellon University." width="" height=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;“&lt;a href="https://15445.courses.cs.cmu.edu/fall2022/slides/01-introduction.pdf" rel="noopener noreferrer"&gt;01 Course Intro &amp;amp; Relational Model - Intro to database systems (15-445/645)&lt;/a&gt;” Andy Pavlo, Carnegie Mellon University.&lt;/p&gt;

&lt;p&gt;And most importantly let’s see some examples of Data models.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.devtools.wtf%2FScreenshot_2023-03-31_at_8.53.40_PM.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.devtools.wtf%2FScreenshot_2023-03-31_at_8.53.40_PM.png" alt="“[01 Course Intro &amp;amp; Relational Model - Intro to database systems (15-445/645)](https://15445.courses.cs.cmu.edu/fall2022/slides/01-introduction.pdf)” Andy Pavlo, Carnegie Mellon University." width="800" height="448"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;“&lt;a href="https://15445.courses.cs.cmu.edu/fall2022/slides/01-introduction.pdf" rel="noopener noreferrer"&gt;01 Course Intro &amp;amp; Relational Model - Intro to database systems (15-445/645)&lt;/a&gt;” Andy Pavlo, Carnegie Mellon University.&lt;/p&gt;

&lt;p&gt;The course is about Relational databases but you might have noticed that there’s a mention to vectors in there! &lt;/p&gt;

&lt;p&gt;This is important here because it gives us the first concrete definition of what a vector database is. &lt;/p&gt;

&lt;p&gt;💡 A vector database is a DBMS that supports a Vector Data Model, in other words it’s a DBMS that uses vectors for describing the data in a database.&lt;/p&gt;

&lt;p&gt;As we will see it’s pretty easy to add support for vectors in most of the existing relational databases that exist today but what makes vector databases a different breed of databases is the native support they have for specific operations around vectors that are important useful for Machine Learning and AI. &lt;/p&gt;

&lt;h2&gt;
  
  
  What is a vector?
&lt;/h2&gt;

&lt;p&gt;Lets take a trip down memory lane. Hopefully the following definition rings some good memories from your youth.&lt;/p&gt;

&lt;p&gt;💡 a vector is a mathematical object that represents a quantity that has both magnitude and direction.&lt;/p&gt;

&lt;p&gt;This is the definition of a vector that most people have encountered at some point in their life.&lt;/p&gt;

&lt;p&gt;If we go a little bit deeper into Wikipedia, we will also find the following general definition of vectors.&lt;/p&gt;

&lt;p&gt;💡 In mathematics and physics, a vector  is a term that refers colloquially to some quantities that cannot be expressed by a single number (a scalar, or to elements of some vector spaces).&lt;/p&gt;

&lt;p&gt;To the above definition let’s add the one that refers to what a vector is in computer science.&lt;/p&gt;

&lt;p&gt;💡 In computer science, an array is a data structure consisting of a collection of elements (values or variables), each identified by at least one array index or key.&lt;/p&gt;

&lt;p&gt;The above definition refers to “array” but array and vector are used interchangeably. &lt;/p&gt;

&lt;p&gt;We will talk more about vector spaces and features and all the cool stuff of AI a bit later but for now the above definitions are what matter.&lt;/p&gt;

&lt;p&gt;First, you have to forget what you might think of vectors at school, we are not talking about euclidian vectors here. Magnitude and direction are not important. &lt;/p&gt;

&lt;p&gt;What is important is the way we plan to represent the world in our database.&lt;/p&gt;

&lt;p&gt;💡 we use quantities that cannot be expressed by a single number, instead we care about elements of some kind of vector space and the way we can represent these values in a computer is as a collection of values with each one being identified by a key or index.&lt;/p&gt;

&lt;p&gt;The above gives us the &lt;strong&gt;how&lt;/strong&gt; we want to &lt;strong&gt;represent&lt;/strong&gt; the world and the &lt;strong&gt;how&lt;/strong&gt; to &lt;strong&gt;store&lt;/strong&gt; this information in a way that a &lt;strong&gt;machine&lt;/strong&gt; can process.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why do we need vectors?
&lt;/h2&gt;

&lt;p&gt;tldr - vectors can allow machines to understand how things like, text, photos and video are related to one another&lt;/p&gt;

&lt;p&gt;So far we’ve been a bit too technical and offering definitions that might make things a bit more clear but we haven’t talked at all about why we even care about vectors. What is wrong with whatever traditional relational databases already offer?&lt;/p&gt;

&lt;p&gt;It all started with our need to represent rich text documents not just syntactically but also semantically. &lt;/p&gt;

&lt;p&gt;The idea is that we can try to represent a document as vectors of identifiers. These vectors now define a document or vector space which happens to also be an algebraic model.&lt;/p&gt;

&lt;p&gt;Because of that, we hope that we can use the mathematical tools of algebraic vector spaces to do interesting things like figuring out how similar two documents are!&lt;/p&gt;

&lt;p&gt;This idea is not new. Do you know this guy?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg75um5tlzwezbzur1qgj.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg75um5tlzwezbzur1qgj.jpg" alt="By Tim Bray (talk) - I created this work entirely by myself., CC BY-SA 3.0," width="360" height="480"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;By Tim Bray (talk) - I created this work entirely by myself., CC BY-SA 3.0,&lt;/p&gt;

&lt;p&gt;In case you don’t, this is Doug Cutting and he’s the author of &lt;a href="https://lucene.apache.org" rel="noopener noreferrer"&gt;Apache Lucene&lt;/a&gt; that was open sourced in 1999. Lucene is probably the first and most well known library for indexing and searching text. Lucene implements the “vector space model” we talked about.&lt;/p&gt;

&lt;p&gt;Vectors and vector spaces are a powerful way to represent information in a way that we can perform search and comparisons beyond what the standard scalar operators allow us to do. &lt;/p&gt;

&lt;p&gt;But hopefully you are already wondering why although Lucene and the vector space model concept exists since the 90’s, we care about vector databases today. Also, is Lucene a vector database?&lt;/p&gt;

&lt;p&gt;To understand why, we need to talk about a few more things first. But before we do that, let’s summarize.&lt;/p&gt;

&lt;p&gt;💡 Vectors are useful because we can turn rich information into vectors in an algebraic model in which we can apply standard algebraic operations like comparisons and measurements. These operations can then be used for information retrieval.&lt;/p&gt;

&lt;h2&gt;
  
  
  Embeddings
&lt;/h2&gt;

&lt;p&gt;Since 1999 and Lucene, it took us about another 13 years to do the next step in information retrieval. &lt;/p&gt;

&lt;p&gt;Welcome to 2013 and to the work of Tomas Mikolov at Google, called Word2Vec. &lt;/p&gt;

&lt;p&gt;Word2Vec is a technique that uses neural networks to learn word associations in from a large corpus of text. These neural networks are generating what is usually called in the NLP literature, &lt;em&gt;word embeddings&lt;/em&gt;, which are representations of words. &lt;/p&gt;

&lt;p&gt;the representation is a real-valued vector that encodes the meaning of the word in such a way that words that are closer in the vector space are expected to be similar in meaning.&lt;/p&gt;

&lt;p&gt;I hope you see how often the term “vector” is being used.&lt;/p&gt;

&lt;p&gt;The beauty of these algorithms is that after we have created these embeddings or vectors, we can use mathematical functions like the cosine similarity, to measure the &lt;em&gt;semantic&lt;/em&gt; similarity of words. &lt;/p&gt;

&lt;p&gt;It’s also important to note that these embeddings or representations are represented as real-valued vectors. &lt;/p&gt;

&lt;h3&gt;
  
  
  Enter Transformers
&lt;/h3&gt;

&lt;p&gt;Today, Word2Vec is not the state of the art in generating embeddings anymore. Instead we are using &lt;em&gt;Transformers&lt;/em&gt; that are deep learning models. Models like GPT are based on Transformers.&lt;/p&gt;

&lt;p&gt;Regardless of the model used though, the output remains the same. Our information is represented as a real-valued vector and we can still use math to retrieve semantic information from our data. &lt;/p&gt;

&lt;p&gt;💡 Embeddings are representations of words that turns them into real-valued vectors that then can be used in conjunction with standard algebraic tools to extract semantic information, i.e. compare semantically two words.&lt;/p&gt;

&lt;h2&gt;
  
  
  let’s put everything together
&lt;/h2&gt;

&lt;p&gt;In 2023 we have some amazing technologies that can take rich information as input, e.g. a novel, and turn it into a new representation which we can query using machines. &lt;/p&gt;

&lt;p&gt;To do that, these technologies turn the information into real-valued vectors.&lt;/p&gt;

&lt;p&gt;To work with this information we now need efficient systems to store and process these real-valued vectors and do that at scale. &lt;/p&gt;

&lt;p&gt;That’s exactly what a vector database is. &lt;/p&gt;

&lt;p&gt;💡 A Vector Database is a DBMS that can efficiently store real-valued vectors of arbitrary dimensions and perform operations on them like applying the cosine-similarity function. On top of that a Vector Database has to also offer all the functionalities commonly found in a DBMS like durability, integrity and manipulation of the data by the user.&lt;/p&gt;

&lt;p&gt;Let’s see now what are the unique characteristics of a Vector Database and how one is built.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Check The Next Article in the Series for how to build one!&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>tutorial</category>
      <category>openai</category>
      <category>chatgpt</category>
      <category>beginners</category>
    </item>
    <item>
      <title>MLOps is 98% Data Engineering</title>
      <dc:creator>Kostas Pardalis</dc:creator>
      <pubDate>Mon, 03 Apr 2023 18:19:05 +0000</pubDate>
      <link>https://forem.com/cpard/mlops-is-98-data-engineering-2bpi</link>
      <guid>https://forem.com/cpard/mlops-is-98-data-engineering-2bpi</guid>
      <description>&lt;h2&gt;
  
  
  MLOps is Mostly Data Engineering
&lt;/h2&gt;

&lt;p&gt;💡 TL;DR MLOps emerged as a new category of tools for managing data infrastructure, specifically for ML use cases with the main assumption being that ML has unique needs. &lt;/p&gt;

&lt;p&gt;After a few years and with the hype gone, it has become apparent that MLOps overlap more with Data Engineering than most people believed. Let’s see why and what that means for the MLOps ecosystem.&lt;/p&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;MLOps is a relatively recent term. A quick search on &lt;a href="https://trends.google.com/trends/explore?date=2019-01-01%202023-03-11&amp;amp;q=%2Fg%2F11h1vbjpbg&amp;amp;hl=en" rel="noopener noreferrer"&gt;Google Trends&lt;/a&gt; reveals that the term started being searched for, around the end of 2019.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8tm20wdz591o08xtvgru.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8tm20wdz591o08xtvgru.png" alt="im1" width="800" height="290"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Upon examining the trend line above, we can observe a significant spike that occurred at the end of 2021. Since then, the interest has remained high.&lt;/p&gt;

&lt;p&gt;ML is not something new though, if we check &lt;a href="https://trends.google.com/trends/explore?date=all&amp;amp;q=Machine%20Learning&amp;amp;hl=en" rel="noopener noreferrer"&gt;Google Trends for that term&lt;/a&gt;, we will see that the term exists since 2004 and with the interest growing exponentially, since 2015.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmixu7g35phqsopp17hyi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmixu7g35phqsopp17hyi.png" alt="im2" width="800" height="302"&gt;&lt;/a&gt;&lt;br&gt;
Interest Over Time for the term Machine Learning on Google&lt;/p&gt;

&lt;p&gt;Machine learning has made amazing progress in the past 10 years, with some of the most important achievements in tech being related to it.&lt;/p&gt;

&lt;p&gt;The rapid growth of machine learning is what sparked the creation of MLOps as a category. With the pace of innovation around ML accelerating, teams and companies have started to have issues keeping up. &lt;/p&gt;

&lt;p&gt;Building and operating ML products started putting a lot of pressure on the data and ML engineering teams and where’s there’s pain there’s also opportunity!&lt;/p&gt;

&lt;p&gt;More and more people started seeing opportunities for bringing new products to the market, promising to turn every company out there with any data, into an AI driven organization. &lt;/p&gt;

&lt;p&gt;And just like this, we reached to the state of the industry you can see below. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foxlnewixxckwlf4pi8pf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foxlnewixxckwlf4pi8pf.png" alt="im3" width="800" height="365"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;MLOps category as included in &lt;a href="https://mattturck.com/landscape/mad2023.pdf" rel="noopener noreferrer"&gt;MAD 2023&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Keep in mind that the above landscape includes only companies labeled as “MLOps” and there are overlaps with other categories in the ML category of &lt;strong&gt;&lt;em&gt;MAD&lt;/em&gt;&lt;/strong&gt; *&lt;strong&gt;&lt;em&gt;2023&lt;/em&gt;&lt;/strong&gt;*. &lt;/p&gt;

&lt;p&gt;43 vendors, around $1B in investments without accounting for public companies like Google and AWS investing in the space. &lt;/p&gt;

&lt;p&gt;What are all these companies offering? Let’s see!&lt;/p&gt;

&lt;h2&gt;
  
  
  What is inside an MLOps platform?
&lt;/h2&gt;

&lt;p&gt;The MLOps vendors can be split among a number of product categories.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Deployment &amp;amp; Serving of models, i.e. OctoML&lt;/li&gt;
&lt;li&gt;Model Quality and Monitoring, i.e. Weights &amp;amp; Biases&lt;/li&gt;
&lt;li&gt;Model training, i.e. AWS Sagemaker&lt;/li&gt;
&lt;li&gt;Feature Stores, i.e. Tecton&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It’s important to mention here that the above categories are supplementary in many cases, for example if you use a Feature Store, you also need a service for model training. &lt;/p&gt;

&lt;p&gt;If you pay attention to the product categories above, you will notice that there is nothing particularly unique about them in the grand scheme of things.&lt;/p&gt;

&lt;p&gt;What do I mean by that:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Deployment and serving of models&lt;/strong&gt; → This is a common operation found in both data engineering and software engineering. People have been deploying pipelines or even better, deploying applications of various complexity way before ML was a thing. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model quality and Monitoring&lt;/strong&gt; → This is a unique problem to ML. The way you monitor a model for quality is not the same as you do with a software project or a data pipeline. But this is only part of the quality problem as we will see later.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model training&lt;/strong&gt; → This is unique to ML but building models is nothing new, the question is what has changed in the past five years that requires a completely different paradigm in doing it?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Feature stores&lt;/strong&gt; → This is one of the most interesting products of MLOps, for the uninitiated the first thing that comes to mind is some kind of specialized data base but feature stores are actually more than that. They are a complete data infrastructure architecture that is proposed and attempted to be productized. How different it is from the classic data infrastructure architectures? We will see.  &lt;/p&gt;

&lt;p&gt;Let’s see how each one of the above categories overlap (or not) with Data Engineering and what that means.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deployment &amp;amp; Serving of Models
&lt;/h2&gt;

&lt;p&gt;This is one of the most interesting aspects of MLOps in my opinion. Mainly because this is the part where the outcome of the work an ML Engineer does gets to the point where concrete value can be generated out of it.&lt;/p&gt;

&lt;p&gt;A recommender can serve recommendations to users and fraud detection can be applied in real time.&lt;/p&gt;

&lt;p&gt;But what is interesting here is that this process doesn’t have much to do with ML, the engineering problems are more related to product engineering. &lt;/p&gt;

&lt;p&gt;We can think of a model as a function that requires some input and generates some output. To deliver value with this function we need a way to add it as part of the product experience we are delivering.&lt;/p&gt;

&lt;p&gt;In engineering terms that means that we have to wrap the model as a service with a clean API that will be exposed to the product engineers.&lt;/p&gt;

&lt;p&gt;Then we need to deploy this service in a scalable and predictable way, just like we do with any other service for our product.&lt;/p&gt;

&lt;p&gt;After that we need to operate the service and ensure that it is provisioned the resources needed based on demand.&lt;/p&gt;

&lt;p&gt;We also need to monitor the service for problems and be able to fix them as soon as possible.&lt;/p&gt;

&lt;p&gt;Finally, we want to have some kind of continuous deployment - integration process to deploy updates to the service. Just like we do with any other service of our product.&lt;/p&gt;

&lt;p&gt;As we can see, the above process is almost identical to managing the release cycle of any other software component out there while it’s primarily the product engineering involved as a stakeholder. &lt;/p&gt;

&lt;p&gt;After all they have to ensure that the new functionality the model provides is integrated in the right way to the product without disrupting its operations. &lt;/p&gt;

&lt;p&gt;There’s one specific need that is imposed to the engineering and ops teams because of having to work with ML models and this is related to monitoring the performance of the model itself but we will talk more about this later.&lt;/p&gt;

&lt;p&gt;The question here is, if integrating a model to our product doesn’t differ than any other feature we release about the product, in terms of the release and platform engineering and operations, why do we need a whole new category of products?&lt;/p&gt;

&lt;p&gt;My opinion here is that the industry is trying to solve the unique challenges of turning models into services by building complete new platforms, but this is less than optimal. &lt;/p&gt;

&lt;p&gt;The true need here is developer tooling that will enrich the existing and proven platforms and methodologies for releasing and operating software at scale for the case of doing that with ML models as the foundational software artifact. &lt;/p&gt;

&lt;p&gt;We don’t need MLOps engineers, we need tools that will allow ML Engineers to package their work in a way that the platform and release engineers will be able to consume and produce the artifacts needed for the product engineers to integrate into the product. &lt;/p&gt;

&lt;p&gt;A recurrent pattern I see is an attempt from vendors who are trying to become category creators to define a new type of engineer. &lt;/p&gt;

&lt;p&gt;In most cases, this is a crossover between existing roles, i.e. analytics engineer where you have someone who’s primarily an analyst but also does some part of the data engineering work, e.g. creates pipelines. &lt;/p&gt;

&lt;p&gt;This is probably a smart marketing move but the world doesn’t work like that. New roles emerge and cannot be forced by a vendor.&lt;/p&gt;

&lt;p&gt;Why we would like ML Engineers to assume responsibilities of a release or platform engineer? Why we would like the former to be introduced to a completely new category of tools that sounds alien to their practice?&lt;/p&gt;

&lt;p&gt;Separation of concerns is a good thing both in software architecture and in organizational design.&lt;/p&gt;

&lt;h2&gt;
  
  
  Model Quality and Monitoring
&lt;/h2&gt;

&lt;p&gt;This is where things are getting really interesting. quality assurance, control and monitoring is a huge topic in software engineering. In a way and with a bit of exaggeration, these are the elements that turn software engineering into… engineering. &lt;/p&gt;

&lt;p&gt;There are many best practices and mature platforms for software quality related tasks. The problem is, that ML models can easily challenge these.&lt;/p&gt;

&lt;p&gt;You might have heard that quality in data infrastructure is hard and it is. It’s not just the software that we have to monitor for quality, it’s also the data. And data is a different beast when it comes to applying quality concepts.&lt;/p&gt;

&lt;p&gt;in ML the situation is even worse. You pretty much have a black box system generated and you need to monitor its performance by just observing its outputs based on the inputs it gets in production. &lt;/p&gt;

&lt;p&gt;Because of this, Model quality and monitoring is usually mentioned together with terms like model drift. Where the model is monitored in terms of its “predictive” performance over time and if it drops under a threshold, we know that we need to retrain it with fresh data.&lt;/p&gt;

&lt;p&gt;Which makes sense, right? As our product changes and our customers behaviors change, the model needs to get retrained to consider these changes.&lt;/p&gt;

&lt;p&gt;I have two main arguments here.&lt;/p&gt;

&lt;p&gt;The first is, how different is the observability of model quality metrics like drift different to any product related monitoring? In product we keep monitoring the performance of our features, do people engage with them in the way we expect? If something changed and engagement dropped, we should address that, right? &lt;/p&gt;

&lt;p&gt;These are all part of what is usually referred as experimentation infrastructure for product and big part of it requires the right data infrastructure and data engineering to exist. &lt;/p&gt;

&lt;p&gt;No matter how unique ML models are, at the end we are going to be observing a service - feature on how it performs interacting with our users and based on the data we collect, figure out if action is needed. &lt;/p&gt;

&lt;p&gt;My feeling is that there’s a lot of overlap here between the ML observability and the data infra - engineering foundations that the organization is building for product experimentation.&lt;/p&gt;

&lt;p&gt;My other argument is about data quality in general. ML models are built on top of data, their quality is a direct reflection of the quality of data used to build them.&lt;/p&gt;

&lt;p&gt;This is a serious problem that data engineering is constantly fighting with and I can’t see how the replication of this process is helping in any way to solve the problem. &lt;/p&gt;

&lt;p&gt;Data engineers are the people who are monitoring the data from its capture to the point where the ML engineer can use it. They have access to the whole supply chain of data and they can monitor and add controls at any point of that chain. &lt;/p&gt;

&lt;p&gt;Adding another platform that is overlapping with both the data engineering and product engineering quality controls is not going to solve the problem and in the worst case it might make it even worse.&lt;/p&gt;

&lt;p&gt;Again, the solution here is engineering tooling to enrich the existing architectures and solutions. Finding out what quality for data entails and equip the people who’s job is to ensure data and product quality to extend their reach into the ML models too.&lt;/p&gt;

&lt;h2&gt;
  
  
  Model Training
&lt;/h2&gt;

&lt;p&gt;This is a short one to be honest. Model Training has more to do with Cloud Computing than anything else and in my opinion this is the space where the big cloud providers are mainly delivering value today. The main reason being the need for hardware to exist to do the actual training. &lt;/p&gt;

&lt;p&gt;But in the general case, model training is nothing more than a data pipeline. Data is read from a number of sources and gets transformed through the application of a training algorithm. It doesn’t matter that much if this going to happen in the CPU or the GPU. &lt;/p&gt;

&lt;p&gt;This is the bread and butter of Data Engineering, the tooling exists already and the main differentiation that I see here is the cloud compute abstraction where we are talking about a completely different category of infrastructure anyway. &lt;/p&gt;

&lt;p&gt;Model training at scale should be part of the data engineering discipline as they have the tooling already, they have the responsibility for the SLAs on the data needed and they can control that release lifecycle much better.&lt;/p&gt;

&lt;p&gt;Do the ML people bother with these operations? I can’t see why to be honest. I believe they would prefer to spend more time in building new models than dealing with operations for data crunching at scale. &lt;/p&gt;

&lt;p&gt;I’m getting boring at this point but again, we don’t need new platforms. We just need to give the right tooling to DEs to communicate effectively with both ML and production engineers and add model training as another step in their ETL pipelines.&lt;/p&gt;

&lt;h2&gt;
  
  
  Feature Stores
&lt;/h2&gt;

&lt;p&gt;I left Feature Stores for the end on purpose as they are a great example of the overlap with data engineering while their popularity is a great indication that something is not right with the current state of data infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.tecton.ai%2Fwp-content%2Fuploads%2F2020%2F10%2Fwhatisfeaturestore2.svg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.tecton.ai%2Fwp-content%2Fuploads%2F2020%2F10%2Fwhatisfeaturestore2.svg" alt="feature" width="1000" height="500"&gt;&lt;/a&gt;&lt;br&gt;
The above is a feature store architecture as presented by Tecton, one of the first and most popular feature store vendors.&lt;/p&gt;

&lt;p&gt;Looking at that we see that we have:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Stream data sources&lt;/li&gt;
&lt;li&gt;Batch data sources&lt;/li&gt;
&lt;li&gt;Transformations&lt;/li&gt;
&lt;li&gt;Storage&lt;/li&gt;
&lt;li&gt;Serving&lt;/li&gt;
&lt;li&gt;Model serving and training&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Feature stores are similar to a typical data infrastructure architecture used by companies that require both streaming and batch processing capabilities. However, they specialize in supporting machine learning features by serving only one type of data consumer - the ML model.&lt;/p&gt;

&lt;p&gt;Vendors have packaged the feature store architecture into products, which has caused some confusion. Some may question the need for another Spark or Flink cluster for real-time feature generation, especially if they are already using those tools for ETL jobs. However, feature stores are useful because they describe what needs to be added to existing data infrastructure to effectively productize machine learning. &lt;/p&gt;

&lt;p&gt;As a product, feature stores should focus on building tooling and practices for data, ML, and product engineers to work together more effectively. Any additional overhead and complexity should be carefully evaluated to ensure that the benefits of using a feature store outweigh the costs.&lt;/p&gt;

&lt;p&gt;Vendors should focus on providing useful tooling to support this, rather than duplicating existing data infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;I hope that by reading this essay you didn’t feel like I’m trying to dismiss MLOps because I’m not. &lt;/p&gt;

&lt;p&gt;I believe that ML and its productization is important and will become even more important in the future and for this to happen the right tooling is needed. &lt;/p&gt;

&lt;p&gt;But it’s time for the MLOps industry to mature and understand who the right audience is, what the problems are and bring the next iteration of solutions in the market. &lt;/p&gt;

&lt;p&gt;Money and time was spent and lessons should have been learned. I can’t wait to see what the next iteration of these products will be.&lt;/p&gt;

&lt;p&gt;There’s a lot of opportunity ahead!&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>datascience</category>
      <category>database</category>
      <category>career</category>
    </item>
    <item>
      <title>What is a Data Contract?</title>
      <dc:creator>Kostas Pardalis</dc:creator>
      <pubDate>Thu, 30 Mar 2023 19:30:26 +0000</pubDate>
      <link>https://forem.com/cpard/what-is-a-data-contract-5dnj</link>
      <guid>https://forem.com/cpard/what-is-a-data-contract-5dnj</guid>
      <description>&lt;p&gt;You might have started hearing the term Data Contract recently and wonder what it is. &lt;/p&gt;

&lt;p&gt;TL;DR: Data contracts are just integration testing, CICD, and APIs for data.&lt;/p&gt;

&lt;p&gt;But, the difficult part of data contracts comes from the organizational complexity of implementing them.&lt;/p&gt;

&lt;p&gt;technically data contracts require to invest into data lineage and data profiling. Nothing much new here.&lt;/p&gt;

&lt;p&gt;Data quality is a team sport though, you really need everyone in the org to buy into the concept of data contracts to implement them effectively. &lt;/p&gt;

&lt;p&gt;Data quality is also hard, so you need to clearly articulate the value it brings for people to stick on caring about it!&lt;/p&gt;

&lt;p&gt;So, for data contracts to succeed, together with the tech we also need:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;create the right environment for people to collaborate cross-functionally.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Communicate effectively across the whole organization, starting from leadership, about the value of data quality or the cost of bad quality in data.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Focus on incremental data quality improvements instead of trying to build a complete end to end solution. Data quality is a continuous process anyway and you want to start delivering value as soon as possible.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Learn from other disciplines. Alert fatigue is a thing and SREs and DevOps have known about it for a long time. No need to reinvent and learn by doing the same mistakes other engineers have already done.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Finally, remember that Data Contracts already exist in any organization. They are just implicit. The whole point of talking about Data Contracts is to encourage people to make them explicit and to understand the value you get by doing that.&lt;/p&gt;

&lt;p&gt;For more on this topic, Chad Sanderson who invented the term Data Contract did an amazing job explaining everything on &lt;a href="https://datastackshow.com/podcast/data-quality-and-data-contracts-with-chad-sanderson-of-data-quality-camp/" rel="noopener noreferrer"&gt;this Data Stack Show episode&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>beginners</category>
      <category>sql</category>
      <category>python</category>
    </item>
    <item>
      <title>A Tutorial on SQL Window Functions.</title>
      <dc:creator>Kostas Pardalis</dc:creator>
      <pubDate>Sat, 18 Mar 2023 21:27:26 +0000</pubDate>
      <link>https://forem.com/cpard/a-tutorial-on-sql-window-functions-1ck1</link>
      <guid>https://forem.com/cpard/a-tutorial-on-sql-window-functions-1ck1</guid>
      <description>&lt;h2&gt;
  
  
  Working with SQL Window Functions
&lt;/h2&gt;

&lt;p&gt;💡 TL;DR: Window functions are among the most powerful and useful features of any SQL query engine. However, the declarative nature of SQL can make them feel counterintuitive when you first start working with them. In this guide, I will demonstrate the beauty of SQL windows and show that they are actually much less intimidating than you might think (and even fun!).&lt;/p&gt;

&lt;h2&gt;
  
  
  SQL Window FunctionsSQL Window Functions
&lt;/h2&gt;

&lt;p&gt;DuckDB provides 14 SQL window-related functions in addition to all the aggregation functions that can be combined with windows. Snowflake, on the other hand, offers more than 70 functions that can be used with SQL windows. PostgreSQL also supports 11 SQL window-related functions, as well as all the aggregation functions that are packaged by default, in addition to any user-provided aggregation function.&lt;/p&gt;

&lt;p&gt;Hopefully, the above information has captured your attention and helped you realize how important SQL windows are, based on the effort database vendors are making to add support for them.&lt;/p&gt;

&lt;p&gt;But what’s a window in SQL?&lt;/p&gt;

&lt;p&gt;The concept of windows is actually pretty simple. It allows us to perform calculations across sets of rows that are related to the current row in some way. &lt;/p&gt;

&lt;p&gt;Think of iterating through all the rows but the calculation we want to perform is related not just to the current row values but also to a subset of the total rows.&lt;/p&gt;

&lt;p&gt;Another way to think about window functions is by considering the &lt;em&gt;GROUP BY&lt;/em&gt; semantics. When we use &lt;em&gt;GROUP BY&lt;/em&gt; we are asking SQL to compute a function by grouping first the data using the parameters of the &lt;em&gt;GROUP BY&lt;/em&gt; clause. Consider the following SQL&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt; &lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;total_actions&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;user_activity&lt;/span&gt; &lt;span class="k"&gt;group&lt;/span&gt; &lt;span class="k"&gt;by&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the above example, we ask SQL to split events among unique user_ids and count them for each user separately. Both the calculation but also the grouping results will be included at the resulting table. So for example:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;user_id&lt;/th&gt;
&lt;th&gt;event&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;click&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;click&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;load&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;load&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;load&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Assuming the above input table, the result of the query will look like this:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;user_id&lt;/th&gt;
&lt;th&gt;total_actions&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;You can think of SQL Windows as something similar in how data is processed but without having necessarily to present the results grouped at the end.&lt;/p&gt;

&lt;p&gt;But windows are actually even more powerful as we will see.&lt;/p&gt;

&lt;h2&gt;
  
  
  What comes after partitioning ?
&lt;/h2&gt;

&lt;p&gt;As we’ve seen previously, partitioning is a first important concept to understand about windows. We can create sub-sets from our data and perform calculations only inside these sub-sets and the main mechanism for defining the sets, is by describing to SQL how to partition the data.&lt;/p&gt;

&lt;p&gt;But we can do more than that! See the example below,&lt;/p&gt;

&lt;p&gt;Here's an example SQL query that uses a window function and the &lt;code&gt;LAG()&lt;/code&gt; function to calculate the time lapse between consecutive events:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
  &lt;span class="n"&gt;event_time&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;event_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;event_time&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;LAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event_time&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;event_time&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;time_lapse&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;
  &lt;span class="n"&gt;user_activity&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This query partitions the data by &lt;code&gt;user_id&lt;/code&gt; and orders it by &lt;code&gt;event_time&lt;/code&gt;. Remember what we said previously about partitioning? You see it here in action. We want our calculation to be performed for each user we have so we will partition on it. &lt;/p&gt;

&lt;p&gt;We will also sort our data based on &lt;code&gt;event_time&lt;/code&gt;, the reason we do that, is because we want to calculate the time it took our user to perform one event after the other. The reason we are sorting is because of what we’ll do next.&lt;/p&gt;

&lt;p&gt;Our query then uses  the&lt;code&gt;LAG()&lt;/code&gt; function, which is the window function that will do the magic for us. What &lt;code&gt;LAG()&lt;/code&gt; does, is to make available the previous value to the current row we currently process, within our window!&lt;/p&gt;

&lt;p&gt;The code part:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;event_time&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;LAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event_time&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Does exactly that, while we go through the current row’s event_time, we get access to the event_time value from the previous row and because the rows are sorted, we can now subtract the values and calculate the time lapse!&lt;/p&gt;

&lt;p&gt;The window magic is defined in this part of the syntax:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;event_time&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;LAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event_time&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;event_time&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;OVER&lt;/code&gt;term indicates the start of the window definition, just try to read this as a sentence written in english and it almost explains what is happening. That’s part of the beauty of being a declarative langue!&lt;/p&gt;

&lt;p&gt;The difference is going to be calculated over sets of data that are created using the user_id, so we get one set of rows for each user and we also sort this set based on the event_time. What is important to remember here is that sorting is not global, instead data is sorted only assuming each individual set defined by the user_id. &lt;/p&gt;

&lt;p&gt;After the partitions have been created and sorted, the query engine starts iterating the rows of each partition and at each one the &lt;em&gt;LAG&lt;/em&gt; function will make the value of the previous row available to the engine. &lt;/p&gt;

&lt;p&gt;At this point, everything is available for the engine to calculate the difference between the two values and that’s exactly what it does!&lt;/p&gt;

&lt;p&gt;&lt;code&gt;LAG()&lt;/code&gt; and the above example is a great introduction to the last important concept about windows in SQL. &lt;em&gt;Framing&lt;/em&gt;!&lt;/p&gt;

&lt;h2&gt;
  
  
  Windows and Frames
&lt;/h2&gt;

&lt;p&gt;Windowing breaks up a relation into independent groups or "partitions," then orders those partitions and computes a new column for each row based on nearby values.&lt;/p&gt;

&lt;p&gt;In many cases, the functions that we apply depend only on the partition boundary and maybe also the ordering, see the very simple first example we went through.&lt;/p&gt;

&lt;p&gt;In other cases though, the function might need access to some of the previous or following values. This was the case with &lt;em&gt;LAG&lt;/em&gt; in our previous example. &lt;/p&gt;

&lt;p&gt;Although we had defined a partition based on the user_id, we also needed to provide to our function (in this case subtraction) with the previous to the current value. This is exactly what &lt;em&gt;LAG&lt;/em&gt; did.&lt;/p&gt;

&lt;p&gt;Frames are a generalization of this concept. &lt;/p&gt;

&lt;p&gt;💡 A frame is a number of rows on either side (preceding or following) of the current row.&lt;/p&gt;

&lt;p&gt;In our previous example, the frame was one row preceding our current row. Although we didn’t provide how many preceding rows we’d like to consider and thus &lt;em&gt;LAG&lt;/em&gt; used the default which is 1 but we can use any number we want.&lt;/p&gt;

&lt;p&gt;In DuckDB the definition of &lt;em&gt;LAG&lt;/em&gt; is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;lag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;expr&lt;/span&gt; &lt;span class="k"&gt;any&lt;/span&gt; &lt;span class="p"&gt;[,&lt;/span&gt; &lt;span class="k"&gt;offset&lt;/span&gt; &lt;span class="nb"&gt;integer&lt;/span&gt; &lt;span class="p"&gt;[,&lt;/span&gt; &lt;span class="k"&gt;default&lt;/span&gt; &lt;span class="k"&gt;any&lt;/span&gt; &lt;span class="p"&gt;]])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Offset&lt;/em&gt; refers to the number of rows preceding the current one that we want to access. We can also, set a &lt;em&gt;default&lt;/em&gt; value to return if the requested offset does not exist. For example, if we are at the first row and want to access the previous one, we can define a default value to return instead.&lt;/p&gt;

&lt;p&gt;Let’s consider the following table:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6x9j6oet9cib6ai56w2e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6x9j6oet9cib6ai56w2e.png" alt="events" width="800" height="632"&gt;&lt;/a&gt;&lt;br&gt;
To better understand the concepts of partitions versus frames, let’s see how the table will look like if we apply a window like the following:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;p_timestamp&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;LAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p_timestamp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;p_timestamp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the table below you can see how it will look like after the application of &lt;em&gt;PARTITION BY&lt;/em&gt;. &lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5xrhviobch3a1sjyjke7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5xrhviobch3a1sjyjke7.png" alt="events" width="800" height="625"&gt;&lt;/a&gt;&lt;br&gt;
And this is how the table looks like after we order it. If you notice you will see that ordering exists only inside each partition and it’s not global.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqnhvarfg1qjpf8ta9z8l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqnhvarfg1qjpf8ta9z8l.png" alt="events" width="800" height="641"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can think of the frame as a “window” sliding over the partitioned and ordered data, with a size equal to the offset parameter, in this case the offset is 1.&lt;/p&gt;

&lt;p&gt;Consider that we are currently at row with event_id = 16249. The Frame will include this row and the previous one based on what we’ve said so far. What do you think the result of the Lag function will be from this frame?&lt;/p&gt;

&lt;p&gt;The answer is 0. Remember that the frame has a default value equal to 0 which is returned when there’s no preceding row? Remember also that the frame has meaning only inside the boundaries of the partition?&lt;/p&gt;

&lt;p&gt;The frame in this case is at the first position of the current partition and as a result the default value will be returned.&lt;/p&gt;
&lt;h2&gt;
  
  
  What about Nulls? Do they matter?
&lt;/h2&gt;

&lt;p&gt;Null values always matter! 😀&lt;/p&gt;

&lt;p&gt;We always should be aware of the null semantics around our window functions. Always check the documentation and see how the window function we care about is behaving in the presence of nulls. &lt;/p&gt;

&lt;p&gt;For example in DuckDB, some functions can be configured to ignore nulls although the default behavior for all window functions is to respect them. Such an example is the &lt;strong&gt;&lt;em&gt;LAG&lt;/em&gt;&lt;/strong&gt; function that we used earlier.&lt;/p&gt;

&lt;p&gt;In any case, make sure you understand well the semantics of your functions and the data you are working with. What if an aggregation function does a division by a null value? &lt;/p&gt;
&lt;h2&gt;
  
  
  Enough with theory, let’s have fun!
&lt;/h2&gt;

&lt;p&gt;Ok, let’s work on some examples using window functions. For these examples we will be using DuckDB. &lt;/p&gt;

&lt;p&gt;First, download DuckDB if you haven’t already. I’d recommend to just &lt;a href="https://duckdb.org/#quickinstall" rel="noopener noreferrer"&gt;download the CLI&lt;/a&gt; but feel free to use any other way of working with DuckDB.&lt;/p&gt;

&lt;p&gt;We will be working with JSON files so you will also need to &lt;a href="https://duckdb.org/docs/extensions/json" rel="noopener noreferrer"&gt;install the JSON extension&lt;/a&gt; for DuckDB. To do that, you just have to:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#./duckdb&lt;/span&gt;

Enter &lt;span class="s2"&gt;".help"&lt;/span&gt; &lt;span class="k"&gt;for &lt;/span&gt;usage hints.
Connected to a transient &lt;span class="k"&gt;in&lt;/span&gt;&lt;span class="nt"&gt;-memory&lt;/span&gt; database.
Use &lt;span class="s2"&gt;".open FILENAME"&lt;/span&gt; to reopen on a persistent database.
D &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="s1"&gt;'json'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That’s all, now you are ready to start playing around and being dangerous.&lt;/p&gt;

&lt;h3&gt;
  
  
  The case of Sessionization
&lt;/h3&gt;

&lt;p&gt;We will use a very common problem that requires window functions. We want to be pragmatic here, so using a real life example that you most probably will face sooner or later is what we are aiming for.&lt;/p&gt;

&lt;p&gt;What is sessionization?&lt;/p&gt;

&lt;p&gt;Assuming we have a number of user interactions captured in different moments in time, how can we group them into “meaningful” buckets. By meaningful in our context, we mean events that happened during one online session of the user. &lt;/p&gt;

&lt;p&gt;💡 &lt;em&gt;Typically a session is defined as&lt;/em&gt;: all events that happened in less than 30 minutes from each other for a specific user.&lt;/p&gt;

&lt;p&gt;The definition can get more complicated but this will suffice for our needs and to be honest it’s one of the most commonly used ones. For example, Google Analytics is using it as the default session definition. &lt;/p&gt;

&lt;h3&gt;
  
  
  The data
&lt;/h3&gt;

&lt;p&gt;Now that we have the tools and the problem, we just need data and we are ready to go.&lt;/p&gt;

&lt;p&gt;Again, we will try to be as realistic as possible. We will be using customer event data captured in the format supported by RudderStack and Segment. &lt;/p&gt;

&lt;p&gt;These are the two most commonly used tools for capturing user interactions.&lt;/p&gt;

&lt;p&gt;For more information on the whole schema of this format, you should check the amazing &lt;a href="https://www.rudderstack.com/docs/event-spec/standard-events/" rel="noopener noreferrer"&gt;documentation provided here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;We will be using data that has been artificially generated, in case you’d like to generate your own data, you can use the tool I used. &lt;a href="https://github.com/cpard/events" rel="noopener noreferrer"&gt;You can find it here&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;I’ll also include a sample file that you can use directly! Using the event generator is useful if you want to experiment with different number of users and events and work on performance of your queries.&lt;/p&gt;

&lt;h3&gt;
  
  
  The queries
&lt;/h3&gt;

&lt;p&gt;The first thing we have to do is to load our data into DuckDB and see how they look like.&lt;/p&gt;

&lt;p&gt;Here I’ll assume that the file is named &lt;em&gt;test.json&lt;/em&gt; and that it’s in the same path that you run duckdb from. Feel free to play around with paths etc, it helps to get a better grasp of how the CLI works and the SQL syntax of DuckDB.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;D&lt;/span&gt; &lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;read_json_objects&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'test.json'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;limit&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And if we execute the above, we’ll see something like this as output:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgmvhj5whfu1ns9c64yp0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgmvhj5whfu1ns9c64yp0.png" alt="raw-json" width="800" height="112"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Although this worked, it’s not exactly useful yet. We just ended up with a table that has just one column of type &lt;em&gt;json&lt;/em&gt; containing our json objects, one line of the input file corresponding to one row of the output table.&lt;/p&gt;

&lt;p&gt;To make this more useful, we need to use some of the additional DuckDB magic for working with JSON. &lt;/p&gt;

&lt;p&gt;consider the following query:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;D&lt;/span&gt; &lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="n"&gt;json_extract&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'$.context.traits.id'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;json_extract&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'$.message_id'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;event_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
        &lt;span class="n"&gt;json_extract&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'$.event_type'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;event_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
        &lt;span class="n"&gt;json_extract&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'original_timestamp'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;original_timestamp&lt;/span&gt; 
    &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;js&lt;/span&gt; 
    &lt;span class="k"&gt;limit&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The result you get should like something like this:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuebco108lr9q1whoxhkc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuebco108lr9q1whoxhkc.png" alt="extract-json" width="800" height="122"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What we did was to use the &lt;em&gt;json_extract&lt;/em&gt; function of DuckDB to extract only the fields we care about. In this case we want:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;user_id, so we can create sessions for each user_id&lt;/li&gt;
&lt;li&gt;event_type, this is not necessary but it might be helpful to have some meta around our data&lt;/li&gt;
&lt;li&gt;original_timestamp, this is obviously needed as we need to perform calculations based on time to create the sessions &lt;/li&gt;
&lt;li&gt;event_id, we want a way to link back to the initial record.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you paid attention to the above results you will notice a few issues&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;All the data types are of type &lt;em&gt;json&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;The &lt;em&gt;event_id&lt;/em&gt; is &lt;em&gt;null&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The first issue is expected and it’s part of our job to take care of it as we build our code. The second issue though is weird, shouldn’t the event have a unique id? Is this a coincidence?&lt;/p&gt;

&lt;p&gt;Let’s see what we can figure out, but first let’s make our lives a bit easier by executing the following sql&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;D&lt;/span&gt; &lt;span class="k"&gt;create&lt;/span&gt; &lt;span class="k"&gt;view&lt;/span&gt; &lt;span class="n"&gt;extracted_json&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; 
        &lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="n"&gt;json_extract&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'$.context.traits.id'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;json_extract&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'$.message_id'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;event_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
            &lt;span class="n"&gt;json_extract&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'$.event_type'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;event_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
            &lt;span class="n"&gt;json_extract&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'original_timestamp'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;original_timestamp&lt;/span&gt; 
        &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;js&lt;/span&gt; 
        &lt;span class="k"&gt;limit&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We create a view so we don’t have to run the long query above every time we want to query it. Now let’s see what we can learn about the &lt;em&gt;message_id&lt;/em&gt; column.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;D&lt;/span&gt; &lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="k"&gt;distinct&lt;/span&gt; &lt;span class="n"&gt;event_id&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;extracted_json&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="err"&gt;┌──────────┐&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;event_id&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;   &lt;span class="n"&gt;json&lt;/span&gt;   &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;├──────────┤&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;     &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;└──────────┘&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The above query gives us back all the distinct values of the column and for &lt;em&gt;event_id&lt;/em&gt; everything is null which is not good! &lt;/p&gt;

&lt;p&gt;Obviously this shouldn’t have happened, the events should have a unique ID but reality is far from ideal and issues like that can always happen, so how do we move forward with this dirty data set we have?&lt;/p&gt;

&lt;p&gt;If we need the event_id, we have two options:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Check our pipelines to see why the event_id wasn’t captured. Maybe when you extracted the data the pipeline ignored the field or maybe it wasn’t even captured at the first place.&lt;/li&gt;
&lt;li&gt;Come up with a solution to add a unique id for our current data set.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Although we don’t need the event_id for our sessionization example, we’ll go through an example of how this could be done. Keep in mind that there are many different ways to do it actually!&lt;/p&gt;

&lt;p&gt;The options:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A common way to create unique IDs is to use hashing. This will also allow you to deduplicate your data if you need to. The way to do it is by using a hashing function, i.e. md5, and hash the whole row. In this case if two rows are completely identical, the generated hashes will be equal.&lt;/li&gt;
&lt;li&gt;An even easier way to do it, is to just add the position of the row in the table as the id. This is going to ensure uniqueness of the id but it won’t help you in deduplicating the data.&lt;/li&gt;
&lt;li&gt;Use something like the uuid DuckdDB function that returns random uuids and hope that random also means unique.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In our case we will opt for the second option as it’s the easier one but please feel free to to try and do the first!&lt;/p&gt;

&lt;p&gt;Also, to perform this task is an awesome gentle introduction into window functions as we will use our first window function.&lt;/p&gt;

&lt;p&gt;See the following SQL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;D&lt;/span&gt; &lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
                    &lt;span class="n"&gt;event_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
                    &lt;span class="n"&gt;original_timestamp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
                    &lt;span class="n"&gt;row_number&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;over&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;event_idx&lt;/span&gt; 
&lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;extracted_json&lt;/span&gt; 
&lt;span class="k"&gt;limit&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faisflvg48jfl2esxsp2i.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faisflvg48jfl2esxsp2i.png" alt="json-table" width="800" height="119"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We excluded event_id as part of cleaning our dataset and used row_number() to get the row number and use it as the id for the event. See the use of the &lt;em&gt;over&lt;/em&gt; keyword? That’s an indication that row_number() is a window function. &lt;/p&gt;

&lt;p&gt;In our case we didn’t want to have partitions because that wouldn’t generate globally unique ids, so we run the window function over the whole table.&lt;/p&gt;

&lt;p&gt;Now that we have a way to generate unique ids let’s figure out how to get rid of the &lt;em&gt;json&lt;/em&gt; type and turn it into something more useful. &lt;/p&gt;

&lt;p&gt;Consider the following SQL query:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;D&lt;/span&gt; &lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="k"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;extracted_json&lt;/span&gt; &lt;span class="k"&gt;limit&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="err"&gt;┌────────────────────────────────────────┐&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;                &lt;span class="n"&gt;user_id&lt;/span&gt;                 &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;                &lt;span class="nb"&gt;varchar&lt;/span&gt;                 &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;├────────────────────────────────────────┤&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="nv"&gt;"15474ff6-3e59-44fa-a875-13c1b2f9d101"&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;└────────────────────────────────────────┘&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The function &lt;em&gt;CAST&lt;/em&gt; is what we need here. We ask DuckDB to take the JSON type and turn it into a String and in this case it worked perfectly as you can see by the result.&lt;/p&gt;

&lt;p&gt;Now consider the following:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;D&lt;/span&gt; &lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="k"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;original_timestamp&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nb"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;original_timestamp&lt;/span&gt; 
&lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;extracted_json&lt;/span&gt; 
&lt;span class="k"&gt;limit&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="n"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;Conversion&lt;/span&gt; &lt;span class="n"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;timestamp&lt;/span&gt; &lt;span class="n"&gt;field&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="k"&gt;out&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;""&lt;/span&gt;&lt;span class="mi"&gt;1970&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;01&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;19&lt;/span&gt; &lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;00&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;17&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt; &lt;span class="n"&gt;UTC&lt;/span&gt;&lt;span class="nv"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
&lt;span class="n"&gt;expected&lt;/span&gt; &lt;span class="n"&gt;format&lt;/span&gt; &lt;span class="k"&gt;is&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;YYYY&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;MM&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;DD&lt;/span&gt; &lt;span class="n"&gt;HH&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;MM&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;SS&lt;/span&gt;&lt;span class="p"&gt;[.&lt;/span&gt;&lt;span class="n"&gt;US&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="err"&gt;±&lt;/span&gt;&lt;span class="n"&gt;HH&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;MM&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="k"&gt;ZONE&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Ouch, we got an error! Apparently the format we used for the date cannot be converted into a timestamp. We need to fix this before we move on. &lt;/p&gt;

&lt;p&gt;💡 These type of issues are extremely common when working with SQL and data in general, that’s why I thought it would be good to actually have to figure this out as part of the exercise!&lt;/p&gt;

&lt;p&gt;Check the following query:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; 
  &lt;span class="k"&gt;CASE&lt;/span&gt; 
    &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;original_timestamp&lt;/span&gt; &lt;span class="k"&gt;LIKE&lt;/span&gt; &lt;span class="s1"&gt;'%.%'&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="n"&gt;strptime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;trim&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;both&lt;/span&gt; &lt;span class="s1"&gt;'"'&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;original_timestamp&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;'%Y-%m-%d %H:%M:%S.%f UTC'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
    &lt;span class="k"&gt;ELSE&lt;/span&gt; &lt;span class="n"&gt;strptime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;trim&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;both&lt;/span&gt; &lt;span class="s1"&gt;'"'&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;original_timestamp&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;'%Y-%m-%d %H:%M:%S UTC'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
  &lt;span class="k"&gt;END&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;parsed_timestamp&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;extracted_json&lt;/span&gt; &lt;span class="k"&gt;limit&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="err"&gt;┌──────────────────────────┐&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;     &lt;span class="n"&gt;parsed_timestamp&lt;/span&gt;     &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;        &lt;span class="nb"&gt;timestamp&lt;/span&gt;         &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;├──────────────────────────┤&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="mi"&gt;1970&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;01&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;19&lt;/span&gt; &lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;00&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;17&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0001&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;└──────────────────────────┘&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;we did it! So what happens with the above query.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;First we need to trim the character “ from both the beginning and the end of the value.&lt;/li&gt;
&lt;li&gt;Then we need to account for two cases, one where milliseconds exist in time and one for when they don’t. Again this is an issue with the generation of the data and we have to fix it here.&lt;/li&gt;
&lt;li&gt;For each case, we use strptime to get a timestamp out of the string.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;em&gt;CASE&lt;/em&gt; is the equivalent of &lt;em&gt;IF-THEN-ELSE&lt;/em&gt; statements in SQL. &lt;/p&gt;

&lt;p&gt;Now that we have figured out everything, let’s transform our raw data into something easier to work with by creating another view.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; 
   &lt;span class="k"&gt;CASE&lt;/span&gt; 
 &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;original_timestamp&lt;/span&gt; &lt;span class="k"&gt;LIKE&lt;/span&gt; &lt;span class="s1"&gt;'%.%'&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="n"&gt;strptime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;trim&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;both&lt;/span&gt; &lt;span class="s1"&gt;'"'&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;original_timestamp&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;'%Y-%m-%d %H:%M:%S.%f UTC'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
     &lt;span class="k"&gt;ELSE&lt;/span&gt; &lt;span class="n"&gt;strptime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;trim&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;both&lt;/span&gt; &lt;span class="s1"&gt;'"'&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;original_timestamp&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;'%Y-%m-%d %H:%M:%S UTC'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
   &lt;span class="k"&gt;END&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;p_timestamp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="k"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="k"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event_type&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;event_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="n"&gt;row_number&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;over&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;event_id&lt;/span&gt;
 &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;extracted_json&lt;/span&gt; &lt;span class="k"&gt;limit&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6b4f1qfq6t9a79ignf4m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6b4f1qfq6t9a79ignf4m.png" alt="casting" width="800" height="136"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;and we have what we need! Now let’s create a view so we can work with it easily.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;create&lt;/span&gt; &lt;span class="k"&gt;view&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="k"&gt;SELECT&lt;/span&gt; 
   &lt;span class="k"&gt;CASE&lt;/span&gt; 
 &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;original_timestamp&lt;/span&gt; &lt;span class="k"&gt;LIKE&lt;/span&gt; &lt;span class="s1"&gt;'%.%'&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="n"&gt;strptime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;trim&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;both&lt;/span&gt; &lt;span class="s1"&gt;'"'&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;original_timestamp&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;'%Y-%m-%d %H:%M:%S.%f UTC'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
     &lt;span class="k"&gt;ELSE&lt;/span&gt; &lt;span class="n"&gt;strptime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;trim&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;both&lt;/span&gt; &lt;span class="s1"&gt;'"'&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;original_timestamp&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;'%Y-%m-%d %H:%M:%S UTC'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
   &lt;span class="k"&gt;END&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;p_timestamp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="k"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="k"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event_type&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;event_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="n"&gt;row_number&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;over&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;event_id&lt;/span&gt;
 &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;extracted_json&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;D&lt;/span&gt; &lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="err"&gt;┌──────────────┐&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;count_star&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;    &lt;span class="n"&gt;int64&lt;/span&gt;     &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;├──────────────┤&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;        &lt;span class="mi"&gt;31415&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;└──────────────┘&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;D&lt;/span&gt; &lt;span class="k"&gt;describe&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnv822bxwpf1zhnfg2s8j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnv822bxwpf1zhnfg2s8j.png" alt="describe-table" width="800" height="257"&gt;&lt;/a&gt;&lt;br&gt;
We are good to go!&lt;/p&gt;

&lt;p&gt;I know it’s been a journey so far, but we already worked with a window function and we also did something important, cleaned and prepared our data! &lt;/p&gt;

&lt;p&gt;This is big part of the work involved with data.&lt;/p&gt;

&lt;p&gt;Now let’s go back to sessions. Remember the definition of a session? &lt;/p&gt;


💡 A session is the set of events for a particular user that happened in less than 30 minutes between successive events.



&lt;p&gt;If you remember the examples you gave earlier you might have already figured out that &lt;em&gt;LAG&lt;/em&gt; is probably a great candidate for helping us with our problem here, let’s see how.&lt;/p&gt;

&lt;p&gt;consider the following query:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;events_enriched&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="k"&gt;SELECT&lt;/span&gt; 
    &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="n"&gt;p_timestamp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="n"&gt;LAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p_timestamp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;p_timestamp&lt;/span&gt; &lt;span class="k"&gt;ASC&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;prev_timestamp&lt;/span&gt;
  &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;
&lt;span class="p"&gt;),&lt;/span&gt; 
&lt;span class="n"&gt;sessions&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="k"&gt;SELECT&lt;/span&gt; 
    &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="n"&gt;p_timestamp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="n"&gt;prev_timestamp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;CASE&lt;/span&gt; 
      &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;p_timestamp&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;prev_timestamp&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;interval&lt;/span&gt; &lt;span class="s1"&gt;'30 minutes'&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="n"&gt;prev_timestamp&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; 
      &lt;span class="k"&gt;ELSE&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; 
    &lt;span class="k"&gt;END&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;p_timestamp&lt;/span&gt; &lt;span class="k"&gt;ASC&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;session_id&lt;/span&gt;
  &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;events_enriched&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p_timestamp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;session_id&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sessions&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p_timestamp&lt;/span&gt; &lt;span class="k"&gt;ASC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here we are also using CTEs that we haven’t talked yet, but don’t worry that much if you find this &lt;em&gt;WITH&lt;/em&gt; syntax new. It’s mainly a way to organize the code and make it cleaner.&lt;/p&gt;

&lt;p&gt;As you can see we start by enriching our events by adding the previous timestamp as a new column. To do that we of course use &lt;em&gt;LAG&lt;/em&gt; and windows! The way that part of the query works should be clear to you by now.&lt;/p&gt;

&lt;p&gt;The second part is the session creation. Here we are creating a new column that tracks the session id. The interesting part is what’s inside the &lt;strong&gt;&lt;em&gt;SUM&lt;/em&gt;&lt;/strong&gt; clause. &lt;/p&gt;

&lt;p&gt;Again here you see the beauty of the declarative nature of SQL. We can add a 0 or 1 based on the difference between the two columns that represent the event times. &lt;/p&gt;

&lt;p&gt;Once again, we use &lt;em&gt;SUM&lt;/em&gt; as a window function, remember that all aggregation functions are window functions, to calculate the session id for each user.&lt;/p&gt;

&lt;p&gt;With that query we will end up with a result like this, I will limit the results to 10 for convenience.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1ml70ewbfx3l719fxxnd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1ml70ewbfx3l719fxxnd.png" alt="final-data" width="800" height="397"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Isn’t pretty? 😊&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Window functions are super powerful. If you master the concepts behind them you’ll be able to write some very expressive and elegant SQL code. &lt;/p&gt;

&lt;p&gt;They might require from you to change the way you are thinking, especially if you are coming from more imperative programming languages but it won’t take that long to get comfortable with them.&lt;/p&gt;

&lt;p&gt;I hope that the examples I gave were helpful! &lt;/p&gt;

&lt;p&gt;In any case, please let me know of what you think and what else you’d like to see as a SQL tutorial.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://duckdb.org/docs/sql/data_types/interval" rel="noopener noreferrer"&gt;DuckDB Intervals Documentation&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://duckdb.org/docs/sql/data_types/timestamp" rel="noopener noreferrer"&gt;DuckDB Timestamp Documentation&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://duckdb.org/docs/sql/functions/char" rel="noopener noreferrer"&gt;DuckDB Text Functions Documentation&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://duckdb.org/docs/sql/functions/dateformat" rel="noopener noreferrer"&gt;DuckDB Date Formats Documentation&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://duckdb.org/docs/extensions/json" rel="noopener noreferrer"&gt;DuckDB JSON Extension Documentation&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://duckdb.org/docs/sql/window_functions" rel="noopener noreferrer"&gt;DuckDB Window Functions Documentation&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://duckdb.org/docs/installation/index" rel="noopener noreferrer"&gt;DuckDB Installation Documentation&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.rudderstack.com/docs/event-spec/standard-events/common-fields/" rel="noopener noreferrer"&gt;RudderStack Event Schema Spec&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.postgresql.org/docs/current/functions-window.html" rel="noopener noreferrer"&gt;PostgreSQL Window Functions Documentation&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.snowflake.com/sql-reference/functions-analytic#overview" rel="noopener noreferrer"&gt;Snowflake Window Functions Documentation&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/cpard/events" rel="noopener noreferrer"&gt;Fake Events Generator Source Code&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/cpard/sql_tutorials/tree/main/window_functions" rel="noopener noreferrer"&gt;Sample Event Data&lt;/a&gt;&lt;/p&gt;

</description>
      <category>tutorial</category>
      <category>sql</category>
      <category>database</category>
      <category>datascience</category>
    </item>
    <item>
      <title>A Practical Guide to SQL Dates with DuckDB.</title>
      <dc:creator>Kostas Pardalis</dc:creator>
      <pubDate>Mon, 06 Mar 2023 00:08:09 +0000</pubDate>
      <link>https://forem.com/cpard/a-guide-to-sql-dates-with-duckdb-2kj3</link>
      <guid>https://forem.com/cpard/a-guide-to-sql-dates-with-duckdb-2kj3</guid>
      <description>&lt;h2&gt;
  
  
  A pragmatic guide on working with dates in SQL using DuckDB
&lt;/h2&gt;

&lt;p&gt;Ok so we talked on how SQL deals with Dates and there's a rich arsenal of tools to help us do whatever we want with time information in SQL. &lt;/p&gt;

&lt;p&gt;But, working with dates is still a pain. &lt;/p&gt;

&lt;p&gt;You might think that the technology is just not there but I would argue that it's not a technical issue but more of a human issue plus the fact that time is just a hard concept.&lt;/p&gt;

&lt;p&gt;Interpreting time is a messy business. &lt;/p&gt;

&lt;p&gt;There are different ways to do it and there's no way that someone can figure this out just by observing syntax. &lt;/p&gt;

&lt;p&gt;For example consider the following date:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"2023-01-02"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What's the right date format?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"yyyy-mm-dd" or "yyyy-dd-mm"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both the above formats can syntactically express the above date but only one is right and to know which one, we either need more samples or we need someone to explicitly tell us what the format is.&lt;/p&gt;

&lt;p&gt;This is just a tiny example that people from Europe moving to the US might have encountered in their every day lives, even without having to write code to parse dates.&lt;/p&gt;

&lt;h2&gt;
  
  
  Computers are simple
&lt;/h2&gt;

&lt;p&gt;They are and they have a very elegant way of representing time. &lt;/p&gt;

&lt;p&gt;The most common way to do it is by using an integer that represents time passed from a specific starting date. &lt;/p&gt;

&lt;p&gt;The most common example of this approach is the &lt;em&gt;Unix Time Stamp&lt;/em&gt; which tracks time as a running total of seconds, the count starts at the &lt;em&gt;Unix Epoch&lt;/em&gt; on January 1st, 1970 at UTC. &lt;/p&gt;

&lt;p&gt;You might wonder about how we count time before that, well, computers can represent negative numbers, right?&lt;/p&gt;

&lt;p&gt;Assuming all time stamps are in UTC, representing dates like this is a very elegant way. &lt;/p&gt;

&lt;p&gt;Applying operations on dates becomes as easy as applying operations on integers, e.g. addition and substraction.&lt;/p&gt;

&lt;p&gt;Of course it wouldn't be fan if everything was perfect. &lt;/p&gt;

&lt;p&gt;So, we still have to ensure that time stamps are aligned on the timezone but this is relatively easy as we just have to offset a specific number to convert any timezone to UTC or any other.&lt;/p&gt;

&lt;p&gt;Humans have agreed on something called &lt;a href="https://en.wikipedia.org/wiki/UTC_offset" rel="noopener noreferrer"&gt;UTC offset&lt;/a&gt;, where the timezone contains also the offset from UTC.&lt;/p&gt;

&lt;p&gt;For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;PT is the Pacific Time Zone. We have PST (Pacific Standard Time) and PDT (Pacific Daylight Time) with:

PST: UTC-08:00
PDT: UTC-07:00
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As you can see from the above, we can easily get a UTC from a PST time by adding -8 hours and from a PDT by adding -7 hours.&lt;/p&gt;

&lt;p&gt;As long as there's syntactical information about the timezone in the date, it's easy to work with it. we just need an index with all the timezone offsets and the timezones they correspond to it. &lt;/p&gt;

&lt;p&gt;But as you can see, even working with just the timezone information can get tricky, what if there's no explicit information about the timezone in the date we got?&lt;/p&gt;

&lt;p&gt;Computers are simple as we said and that means that they don't assume things on the other hand humans love implicit information.&lt;/p&gt;

&lt;p&gt;And again, this is where things get hard with dates.&lt;/p&gt;

&lt;h2&gt;
  
  
  Serializations are not that simple though
&lt;/h2&gt;

&lt;p&gt;Let's get a bit more practical now and consider how information is exchanged between machines on the Internet.&lt;/p&gt;

&lt;p&gt;We will consider the following formats:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;JSON&lt;/li&gt;
&lt;li&gt;CSV&lt;/li&gt;
&lt;li&gt;Protocol Buffers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;JSON&lt;/strong&gt; is the defacto format for exchanging data between web services. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CSV&lt;/strong&gt; is probably the oldest human readable format and despite it's age and issues, it's still there. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Protocol Buffers&lt;/strong&gt; is the defacto format for gRPC and related microservices. &lt;/p&gt;

&lt;p&gt;For the above reasons, we can assume that anyone working in tech will have to work with them sooner or later. &lt;/p&gt;

&lt;p&gt;(You can't escape!)&lt;/p&gt;

&lt;p&gt;Let's see how each format represents date and time information.&lt;/p&gt;

&lt;h3&gt;
  
  
  JSON
&lt;/h3&gt;

&lt;p&gt;How does JSON represent dates and times? Is there a datatype used?&lt;/p&gt;

&lt;p&gt;The answer is: It doesn't handle dates and there's no datatype similar to what SQL has. &lt;/p&gt;

&lt;p&gt;Dates and times are strings or integers if someone decides to store it as a timestamp but there's no way to tell by validating the JSON doc if a key-value pair is representing a date or not. You need a human for that.&lt;/p&gt;

&lt;h3&gt;
  
  
  CSV
&lt;/h3&gt;

&lt;p&gt;CSV also does not have a date or time type. Actually it doesn't have types at all. &lt;/p&gt;

&lt;p&gt;At least with JSON a numerical field will always have a number. &lt;/p&gt;

&lt;p&gt;In CSV you can have a column - field that has mixed values and there's no way to guarantee that this will not happen.&lt;/p&gt;

&lt;h3&gt;
  
  
  Protocol Buffers
&lt;/h3&gt;

&lt;p&gt;Protobuf has support for &lt;a href="https://protobuf.dev/reference/protobuf/google.protobuf/#timestamp" rel="noopener noreferrer"&gt;timestamp types&lt;/a&gt;. It only supports a timestamp type, represented as a seconds and fractions of seconds at nanosecond resoltuion in UTC Epoch time (remember Unix Epoch?). &lt;/p&gt;

&lt;p&gt;Obviously the support Protocol Buffers has when it comes to representing time is much more limited than something like SQL and for a good reason. &lt;/p&gt;

&lt;p&gt;Protocol Buffers care only about computer communication, its design is not driven by the need to render dates and times to end users in every possible way. &lt;/p&gt;

&lt;p&gt;Also, by limiting the ways that time can be represented, they can perform a lot of optimizations on how the information is serialized on the wire. &lt;/p&gt;

&lt;p&gt;Something that is super important for protocols like that.&lt;/p&gt;

&lt;h3&gt;
  
  
  So what did we learn here?
&lt;/h3&gt;

&lt;p&gt;Support for date types varies from protocol to protocol and depending on where the data comes from, an engineer will pretty much have to deal with every possible scenario. &lt;/p&gt;

&lt;p&gt;So, how can we work with dates and time without shooting ourselves in the foot?&lt;/p&gt;

&lt;h2&gt;
  
  
  Some Examples with DuckDB
&lt;/h2&gt;

&lt;p&gt;Let's see some examples using &lt;a href="https://duckdb.org" rel="noopener noreferrer"&gt;DuckDB&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Consider the following CSV data:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;user_id&lt;/th&gt;
&lt;th&gt;Event&lt;/th&gt;
&lt;th&gt;Timestamp&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;click&lt;/td&gt;
&lt;td&gt;2023-02-28T21:07:13+00:00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;load_page&lt;/td&gt;
&lt;td&gt;2023-02-28T21:08:13+00:00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;load_page&lt;/td&gt;
&lt;td&gt;2023-02-28T21:09:13+00:00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;click&lt;/td&gt;
&lt;td&gt;2023-02-28T21:10:13+00:00&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The above timestamp is represented in &lt;a href="https://www.rfc-editor.org/rfc/rfc3339" rel="noopener noreferrer"&gt;RFC3339&lt;/a&gt; format.&lt;/p&gt;

&lt;p&gt;Let's see how we can parse this into SQL types using DuckDB. &lt;/p&gt;

&lt;p&gt;DuckDB has great &lt;a href="https://duckdb.org/docs/data/csv" rel="noopener noreferrer"&gt;CSV parsing support&lt;/a&gt;. Assuming our csv file is named &lt;code&gt;events.csv&lt;/code&gt; we execute the following command.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;create&lt;/span&gt; &lt;span class="k"&gt;view&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;read_csv_auto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'events.csv'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and we get the following results:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj6xntfy5mfbfan1lyhtq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj6xntfy5mfbfan1lyhtq.png" alt="duckdb-p0" width="712" height="234"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;what is amazing is that DuckDB managed to guess the timestamp type and import it as timestamp directly!! &lt;/p&gt;

&lt;p&gt;That's because of the amazing work the DuckDB folks have done in delivering the best possible experience but also because we chose to use a standard like RFC3339 for representing dates.&lt;/p&gt;

&lt;p&gt;Now we can move on and keep working with our data. &lt;/p&gt;

&lt;p&gt;Now let's see how powerfull DuckDB is in infering date formats. To do that, let's change the date from &lt;code&gt;2023-02-28&lt;/code&gt; to &lt;code&gt;2023-02-02&lt;/code&gt;. &lt;/p&gt;

&lt;p&gt;The reason we want to try this is because in the first case, it's easier to infer the format as we have only 12 months so the 28th should be a day.&lt;/p&gt;

&lt;p&gt;Interestingly enough, DuckDB still infers the date format and creates a timestamp column! &lt;/p&gt;

&lt;p&gt;Just try it using the same code as previously and you will see.&lt;/p&gt;

&lt;p&gt;Finally, if we completely remove the time from the data, DuckDB will infer and use the data type &lt;code&gt;Date&lt;/code&gt; instead of &lt;code&gt;Datetime&lt;/code&gt;. &lt;/p&gt;

&lt;p&gt;So, DuckDB is doing a great job in helping us work with date time data! &lt;/p&gt;

&lt;p&gt;Let's update our previous table and add a column that has a Unix Timestamp too.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;user_id&lt;/th&gt;
&lt;th&gt;Event&lt;/th&gt;
&lt;th&gt;Timestamp&lt;/th&gt;
&lt;th&gt;unix_time&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;click&lt;/td&gt;
&lt;td&gt;2023-02-28T21:07:13+00:00&lt;/td&gt;
&lt;td&gt;1677647233&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;load_page&lt;/td&gt;
&lt;td&gt;2023-02-28T21:08:13+00:00&lt;/td&gt;
&lt;td&gt;1677647293&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;load_page&lt;/td&gt;
&lt;td&gt;2023-02-28T21:09:13+00:00&lt;/td&gt;
&lt;td&gt;1677647353&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;click&lt;/td&gt;
&lt;td&gt;2023-02-28T21:10:13+00:00&lt;/td&gt;
&lt;td&gt;1677647413&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;D&lt;/span&gt; &lt;span class="k"&gt;create&lt;/span&gt; &lt;span class="k"&gt;view&lt;/span&gt; &lt;span class="n"&gt;events_time&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;read_csv_auto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'events_time.csv'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;D&lt;/span&gt; &lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;events_time&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and the results we get are:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj28da71zqw7etyngrnp4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj28da71zqw7etyngrnp4.png" alt="duckdb-dates-1" width="800" height="204"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As you can see, DuckDB doesn't know that &lt;code&gt;unix_time&lt;/code&gt; is a timestamp, it registers the data type for this column as int64 which is what should be expected.&lt;/p&gt;

&lt;p&gt;Many times we will just have timestamps being shared and not dates, how do we deal with that?&lt;/p&gt;

&lt;p&gt;We just need to let DuckDB know what the data we are dealing with is. &lt;/p&gt;

&lt;p&gt;In this case we know we are dealing with Unix timestamps and &lt;a href="https://duckdb.org/docs/sql/functions/timestamp" rel="noopener noreferrer"&gt;DuckDB&lt;/a&gt; has a number of utility functions to help us, more specifically we will look into the &lt;code&gt;epoch_ms&lt;/code&gt; function. &lt;/p&gt;

&lt;p&gt;Let's execute the following SQL statement:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;epoch_ms&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;unix_time&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;unix_time&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;events_time&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The result we are getting is:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8y66je0uh6akj1ebyz2m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8y66je0uh6akj1ebyz2m.png" alt="duckdb-dates-2" width="800" height="172"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Which is obviously wrong, actually the timestamp and unix_time columns should match and there's a huge difference between events happening in 2023 and 1970.&lt;/p&gt;

&lt;p&gt;What went wrong? Remember the Unix Timestamp definition, it's measured in seconds from the Unix Epoch but the signature of our function is &lt;code&gt;epoch_ms&lt;/code&gt; which expects milliseconds. &lt;/p&gt;

&lt;p&gt;Let's try again with a slightly different query:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;epoch_ms&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;unix_time&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;unix_time&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;events_time&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now the results look like:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu594x4ctfkjed8ogs0et.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu594x4ctfkjed8ogs0et.png" alt="duckdb-dates-3" width="800" height="182"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Better but still doesn't look right. What's going on here? Let's try something to see if we can figure out what the issue is. Consider the following query.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="nb"&gt;timestamp&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;epoch_ms&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;unix_time&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;events_time&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The results will look like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq9edxrf8wh2tqepwx8cp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq9edxrf8wh2tqepwx8cp.png" alt="duckdb-dates-4" width="770" height="249"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There's something interesting here, the different is standard between the two timestamp columns and it also happens that the difference is equal to the difference between the timezone I live at (Pacific Time) and UTC.&lt;/p&gt;

&lt;p&gt;So, the reason there's a discrepancy between the two columns is because of the different timezones used and if we would like to be consistent, we should take into account the timezone in both cases.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Consistent semantics around time is one of the most important best practices when dealing with time in databases.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;DuckDB has recently added support for working with &lt;a href="https://duckdb.org/docs/extensions/json.html" rel="noopener noreferrer"&gt;JSON documents too&lt;/a&gt;. You should give it a try and repeat the above examples using JSON instead of CSV and see what the differences might be, if any.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Consistent Semantics
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;Consistency is key!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Keep it consistent, that's the best advice when it comes to working with dates. Stick with a format and make sure that everyone follows that.&lt;/p&gt;

&lt;p&gt;Now, that's easier said than done, after all we will always have bugs and make errors. Even when there's strict concensus on how to represent something. &lt;/p&gt;

&lt;h3&gt;
  
  
  First load, then deal with the data
&lt;/h3&gt;

&lt;p&gt;First load your data into your database and then try to figure out if issues exist or not and how to deal with them.&lt;/p&gt;

&lt;p&gt;That doesn't mean that you should load the data into production tables though. On the contrary, load the data using temporary dev tables and then merge with production when you are sure on the data format.&lt;/p&gt;

&lt;p&gt;Why do that? Because the last thing you want to do is to fail a whole pipeline because one date is malformed. To summarize. &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Separate ingestion and loading logic on your data pipelines. First load into a temp table, then validate and then merge with production tables.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Ingestion takes time, especially with big workloads and a lot of that goes into IO and converting from one format to the other. You don't want to waste resources doing all that because of syntactic issues that can be solved.&lt;/p&gt;

&lt;h4&gt;
  
  
  Separate Storage from Presentation
&lt;/h4&gt;

&lt;p&gt;Because you will be sharing date and time information with humans, it doesn't mean that you should store the data in the same format that a human expects.&lt;/p&gt;

&lt;p&gt;Instead, stick to timestamps, assuming a standard timezone (most probably UTC) and whenever data has to be presented to the user, let the client figure out the best way to present the information. &lt;/p&gt;

&lt;p&gt;By doing that, you ensure clear semantics across all your storage and at the same time maximize flexibility in terms of presenting the information to the consumer.&lt;/p&gt;

&lt;h3&gt;
  
  
  Learn to trust your database
&lt;/h3&gt;

&lt;p&gt;Don't try to re-invent the wheel, no matter what issue you have with your data, when it comes to dates there's definitely someone who had to deal with the same in the past. &lt;/p&gt;

&lt;p&gt;This knowledge has been integrated into query engines and databases and there are plenty of tools and best practices to follow. &lt;/p&gt;

&lt;h2&gt;
  
  
  Useful Links &amp;amp; References
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.unixtimestamp.com" rel="noopener noreferrer"&gt;Online Unix Timestamp Converter&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://duckdb.org/docs/sql/functions/date" rel="noopener noreferrer"&gt;DuckDB Date Functions&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://duckdb.org/docs/sql/functions/timestamp" rel="noopener noreferrer"&gt;DuckDB Timestamp Functions&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://duckdb.org/docs/data/overview" rel="noopener noreferrer"&gt;DuckDB Data Import documentation&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://duckdb.org" rel="noopener noreferrer"&gt;DuckDB Client Download&lt;/a&gt;&lt;/p&gt;

</description>
      <category>beginners</category>
      <category>sql</category>
      <category>tutorial</category>
      <category>database</category>
    </item>
    <item>
      <title>Introduction to SQL Timestamp, Date and Time Data Types.</title>
      <dc:creator>Kostas Pardalis</dc:creator>
      <pubDate>Tue, 21 Feb 2023 20:49:47 +0000</pubDate>
      <link>https://forem.com/cpard/working-with-sql-timestamp-date-and-time-data-types-3g17</link>
      <guid>https://forem.com/cpard/working-with-sql-timestamp-date-and-time-data-types-3g17</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;TL;DR: Working with Time in SQL is one of the most common tasks that people are seeking help for. &lt;br&gt;
In this article I will cover the most common ways to work with time in SQL.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  SQL Timestamp, Date and Time Data Types
&lt;/h2&gt;

&lt;p&gt;The &lt;a href="http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt" rel="noopener noreferrer"&gt;SQL standard&lt;/a&gt; defines the datetime datatype&lt;br&gt;
which can be further specified using the following descriptors:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;DATE&lt;/li&gt;
&lt;li&gt;TIME&lt;/li&gt;
&lt;li&gt;TIMESTAMP&lt;/li&gt;
&lt;li&gt;TIME WITH TIME ZONE&lt;/li&gt;
&lt;li&gt;TIMESTAMP WITH TIME ZONE&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Together with the value of the time fractional seconds precision if the descriptor is one of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;TIME&lt;/li&gt;
&lt;li&gt;TIMESTAMP&lt;/li&gt;
&lt;li&gt;TIME WITH TIME ZONE&lt;/li&gt;
&lt;li&gt;TIMESTAMP WITH TIME ZONE&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;SQL also defines intervals in a similar way. But there's only one possible type desceriptor, in this case INTERVAL.&lt;/p&gt;

&lt;p&gt;Together with, an indication of whether the interval is a year-month or day-time interval and finally the interval qualifier that describes&lt;br&gt;
the precision of the interval data type.&lt;/p&gt;
&lt;h2&gt;
  
  
  Things to Remember
&lt;/h2&gt;

&lt;p&gt;👉 Datetimes only have absolute meaning in the context of additional information. &lt;/p&gt;

&lt;p&gt;So, Unless that time zone specifier, and its meaning, is known, the meaning of the datetime value is ambiguous.&lt;/p&gt;

&lt;p&gt;Therefore, datetime data types that contain time fields (TIME and TIMESTAMP) are maintained in Universal Coordinated Time (UTC), with an explicit or implied time zone part.&lt;/p&gt;

&lt;p&gt;👉 The time zone part is an interval specifying the difference between&lt;br&gt;
         UTC and the actual date and time in the time zone represented by&lt;br&gt;
         the time or timestamp data item.&lt;/p&gt;

&lt;p&gt;👉 Items of type datetime are mutually comparable only if they have&lt;br&gt;
         the same datetime fields.&lt;/p&gt;

&lt;p&gt;👉 There is an ordering of the significance of datetime fields. This&lt;br&gt;
         is, from most significant to least significant: YEAR, MONTH, DAY,&lt;br&gt;
         HOUR, MINUTE, and SECOND.&lt;/p&gt;

&lt;p&gt;Ok, that's all you need to know! (Joking, there's more) but the above statements are helpful to always keep in mind&lt;br&gt;
when working with time in SQL.&lt;/p&gt;
&lt;h2&gt;
  
  
  Operating on Datetime types
&lt;/h2&gt;

&lt;p&gt;The basic arithmetic operators are supported for Datetime type, although it's always a good idea&lt;br&gt;
to check the documentation of the database you are using to check the exact semantics implemented for the operator.&lt;/p&gt;

&lt;p&gt;The following operators are the most commonly found ones:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Addition (+)

&lt;ul&gt;
&lt;li&gt;date + integer&lt;/li&gt;
&lt;li&gt;date + interval&lt;/li&gt;
&lt;li&gt;date + time &lt;/li&gt;
&lt;li&gt;interval + interval &lt;/li&gt;
&lt;li&gt;timestamp + interval &lt;/li&gt;
&lt;li&gt;time + interval &lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Subtraction (-)

&lt;ul&gt;
&lt;li&gt;-interval (Negation)&lt;/li&gt;
&lt;li&gt;date - date&lt;/li&gt;
&lt;li&gt;date - integer&lt;/li&gt;
&lt;li&gt;date - interval&lt;/li&gt;
&lt;li&gt;time - time &lt;/li&gt;
&lt;li&gt;time - interval &lt;/li&gt;
&lt;li&gt;timestamp - interval&lt;/li&gt;
&lt;li&gt;interval - interval&lt;/li&gt;
&lt;li&gt;timestamp - timestamp&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Multiplication (*)

&lt;ul&gt;
&lt;li&gt;interval * double precision&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Division (/)

&lt;ul&gt;
&lt;li&gt;interval / double precision
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Again, the most important thing is to check with the database documentation on the semantics of the above operations.&lt;/p&gt;

&lt;p&gt;For example, what should be the return type of each operation?&lt;/p&gt;
&lt;h2&gt;
  
  
  Date and Time Functions
&lt;/h2&gt;

&lt;p&gt;There are some functions are that more important than others or at least you are going to need them more often.&lt;br&gt;
These are:&lt;/p&gt;

&lt;p&gt;👉 First the popular &lt;code&gt;trunc(field, source)&lt;/code&gt; function. Where source is a value expression of a datetime related type and field is used &lt;br&gt;
to define the precision of the truncation. Some examples of valid field values from postgres are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;microseconds&lt;/li&gt;
&lt;li&gt;milliseconds&lt;/li&gt;
&lt;li&gt;second&lt;/li&gt;
&lt;li&gt;minute&lt;/li&gt;
&lt;li&gt;hour&lt;/li&gt;
&lt;li&gt;day&lt;/li&gt;
&lt;li&gt;week&lt;/li&gt;
&lt;li&gt;month&lt;/li&gt;
&lt;li&gt;quarter&lt;/li&gt;
&lt;li&gt;year&lt;/li&gt;
&lt;li&gt;decade&lt;/li&gt;
&lt;li&gt;century&lt;/li&gt;
&lt;li&gt;millennium&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As usual, check your documentation!&lt;/p&gt;

&lt;p&gt;👉 Then we have &lt;code&gt;now()&lt;/code&gt; which returns the current timestamp, usually at the beginning of the query execution.&lt;br&gt;
There also a number of other similar functions, like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;current_date&lt;/li&gt;
&lt;li&gt;current_time&lt;/li&gt;
&lt;li&gt;current_timestamp&lt;/li&gt;
&lt;li&gt;current_time(precision)&lt;/li&gt;
&lt;li&gt;and many more...&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;👉 The &lt;code&gt;extract(field from source)&lt;/code&gt; function is used to retrieve subfields such as year or hour from date/time values. &lt;br&gt;
The extract function is primarily intended for computational processing.&lt;/p&gt;

&lt;p&gt;👉 The &lt;code&gt;date_part(field, source)&lt;/code&gt; function extracts a part of the date, timestamp, or interval.&lt;/p&gt;

&lt;p&gt;👉 &lt;code&gt;date_format(expr, fmt)&lt;/code&gt; is used to format a date time value using the format specified by fmt. &lt;/p&gt;

&lt;p&gt;Formatting is a bit of a complicated matter as there's a lot of flexibility on how a date can be formatted. &lt;/p&gt;

&lt;p&gt;There's a "language" that is used to define patterns for the formatting. This language is at least similar&lt;br&gt;
among the different databases and as I've said many times, you should check the documentation! &lt;/p&gt;

&lt;p&gt;👉 Finaly there are the casting expressions like &lt;code&gt;cast(expr as type)&lt;/code&gt; that are used to cast between different datetime types and others.&lt;br&gt;
For example casting a string into a date or timestamp or a datetime to timestamp etc. &lt;/p&gt;

&lt;p&gt;The above are some of the most commonly used functions and operators when working with time in SQL.&lt;br&gt;
Most database systems offer many more for convenience but the above should be enough to cover most of the use cases.&lt;/p&gt;
&lt;h2&gt;
  
  
  Windows
&lt;/h2&gt;

&lt;p&gt;This is where things starts getting more complicated when working with dates and times in SQL.&lt;/p&gt;

&lt;p&gt;Windows are important as they allow to define partitions over your data and perform aggregations.&lt;/p&gt;

&lt;p&gt;Consider for example the case of calculating the Monthly Recurrent Revenue. To do that, you first &lt;br&gt;
need to partition your data by month and then calculate the sum of the revenue for each month.&lt;/p&gt;

&lt;p&gt;The most common way of defining a window is using the &lt;code&gt;OVER&lt;/code&gt; clause. For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;date_trunc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'month'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;month&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;revenue&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;date_trunc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'month'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;date&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;mrr&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;
    &lt;span class="n"&gt;transactions&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The above query will calculate the MRR for each month and it's a good example of how powerful the combination&lt;br&gt;
of windows and datetime functions are.&lt;/p&gt;

&lt;p&gt;There are other types of windows available but they can be grouped in two main categories.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ranking windows&lt;/li&gt;
&lt;li&gt;Aggregate (or analytic) windows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The most common example of a ranking window function is &lt;code&gt;rank()&lt;/code&gt; which returns the rank of a value compared to all values in the partition.&lt;/p&gt;

&lt;p&gt;A common example of an analytic window function is &lt;code&gt;lag(args)&lt;/code&gt; which returns the value of expr from a preceding row within the partition.&lt;/p&gt;

&lt;p&gt;Again, you will find examples and more details in the documentation of your database system.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Working with datetime in SQL can easily get complicated, especially when we start working with &lt;br&gt;
formatting and windows. &lt;/p&gt;

&lt;p&gt;But the most common tasks associated with time in SQL are not that complicated.&lt;br&gt;
There's a small number of data types and a large number of helper functions that can help a lot.&lt;/p&gt;

&lt;p&gt;Below you can find a list of links to the documentation of some well known OLAP and OLTP databases.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://docs.databricks.com/sql/language-manual/sql-ref-functions-builtin.html#date-timestamp-and-interval-functions" rel="noopener noreferrer"&gt;Databricks - Spark Date-Time &amp;amp; Functions&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.databricks.com/sql/language-manual/sql-ref-functions-builtin.html#ranking-window-functions" rel="noopener noreferrer"&gt;Databricks - Spark Windows (ranking)&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.databricks.com/sql/language-manual/sql-ref-functions-builtin.html#analytic-window-functions" rel="noopener noreferrer"&gt;Databricks - Spark Windows (analytics)&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.snowflake.com/en/sql-reference/data-types-datetime.html" rel="noopener noreferrer"&gt;Snowflake Types&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.snowflake.com/en/user-guide/functions-window-using.html" rel="noopener noreferrer"&gt;Snowflake Windows&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.postgresql.org/docs/current/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT" rel="noopener noreferrer"&gt;PostgreSQL Functions&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.postgresql.org/docs/15/tutorial-window.html" rel="noopener noreferrer"&gt;PostgreSQL Windows&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://clickhouse.com/docs/en/sql-reference/functions/date-time-functions/" rel="noopener noreferrer"&gt;Clickhouse Functions&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://clickhouse.com/docs/en/sql-reference/functions/time-window-functions/" rel="noopener noreferrer"&gt;Clickhouse Windows&lt;/a&gt;&lt;/p&gt;

</description>
      <category>career</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Adding a new data type to a sql query engine (Trino)</title>
      <dc:creator>Kostas Pardalis</dc:creator>
      <pubDate>Mon, 06 Feb 2023 04:00:18 +0000</pubDate>
      <link>https://forem.com/cpard/adding-a-new-data-type-to-a-sql-query-engine-trino-5f85</link>
      <guid>https://forem.com/cpard/adding-a-new-data-type-to-a-sql-query-engine-trino-5f85</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;In this post I want to take you through a journey that starts with &lt;a href="https://github.com/trinodb/trino&amp;lt;br&amp;gt;%0A/issues/1284" rel="noopener noreferrer"&gt;Github Issue #1284&lt;/a&gt; requesting support for nanosecond/microsecond precision in TIMESTAMP for Trino and ends with the merge of &lt;a href="https://github.com/trinodb/trino/pull/3783" rel="noopener noreferrer"&gt;Github PR #3783&lt;/a&gt; that added support for the parametric TIMESTAMP type to the Trino query engine.&lt;/p&gt;

&lt;p&gt;This journey includes a number of surprises too!&lt;/p&gt;

&lt;p&gt;For example the identification of issues on the semantics of the TIMESTAMP type on Trino as explained in a lot of detail in &lt;a href="https://github.com/trinodb/trino/issues/37" rel="noopener noreferrer"&gt;Github Issue #34&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Our goal is to go deep into some important dimensions of a SQL query engine, including the type system and data encoding.&lt;/p&gt;

&lt;p&gt;But also get a taste of what it takes to engineer a complex system like a distributed SQL query engine that is being used daily by thousands of users.&lt;/p&gt;

&lt;p&gt;This is going to be a long post, so buckle up!&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;When working with time in a digital system like a computer, precision is one of the things that we deeply care about. Especially when time is important for the tasks we want to perform.&lt;/p&gt;

&lt;p&gt;Some industries like finance are more sensitive to time&lt;br&gt;
measurements than others but in any case, we want to know well the semantics of the time data types we work with.&lt;/p&gt;

&lt;p&gt;In 2020 FINRA also known as the Financial Industry Regulatory Authority, submitted a proposal to the SEC for the change of rules relating to the granularity of timestamps in trade reports.&lt;/p&gt;

&lt;p&gt;The proposal suggests to start tracking trades using nanosecond time granularity.&lt;/p&gt;

&lt;p&gt;A system that supports nanosecond timestamp precision,&lt;br&gt;
can work with timestamps in millisecond precision without losing any information. The opposite is not true though.&lt;/p&gt;

&lt;p&gt;Trino can support up to &lt;a href="https://trino.io/docs/current/language/types.html#date-and-time" rel="noopener noreferrer"&gt;picosecond timestamp precision&lt;/a&gt;!&lt;/p&gt;

&lt;p&gt;Trino is a technology used heavily by the financial sector.&lt;br&gt;
This is one of the &lt;a href="https://github.com/trinodb/trino/issues/1284" rel="noopener noreferrer"&gt;reasons&lt;/a&gt; the Trino community started looking into updating the TIMESTAMP type in Trino.&lt;/p&gt;
&lt;h2&gt;
  
  
  SQL Data Types
&lt;/h2&gt;

&lt;p&gt;A data type is a set of representable values and the physical representation of any of the values, is implementation-dependent.&lt;/p&gt;

&lt;p&gt;The value of a Timestamp might be 1663802507 but how it is physically represented is left to the database engineer to decide.&lt;/p&gt;

&lt;p&gt;Usually, Datetime types (including Timestamp) are physically implemented using 32/64-bit integers.&lt;/p&gt;

&lt;p&gt;The SQL Specification allows for arbitrary precision of Datetime types and does that by stating the following:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;...SECOND, however,&lt;br&gt;
can be defined to have a that indicates the number of decimal digits maintained&lt;br&gt;
following the decimal point in the seconds value, a non-negative exact numeric value.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;
  
  
  Parametarized Data Types
&lt;/h3&gt;

&lt;p&gt;If you are familiar with SQL you will easily recognize the following statement:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="n"&gt;orderkey&lt;/span&gt; &lt;span class="nb"&gt;bigint&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The above will create a table with a column named orderkey and its type will be bigint. There's no way we can parameterize bigint.&lt;/p&gt;

&lt;p&gt;This is true for the majority of types but there are a couple exceptions, with timestamps being one of them.&lt;/p&gt;

&lt;p&gt;For reasons that we will investigate later, it does make a difference when you are dealing with a timestamp in milliseconds versus picoseconds.&lt;/p&gt;

&lt;p&gt;We'd like to allow the user to define the granularity and take advantage of that in optimizing how the data is manipulated.&lt;/p&gt;

&lt;p&gt;In Trino the Timestamp type looks like this Timestamp(p),&lt;br&gt;
where p is is the number of digits of precision for the fraction of seconds.&lt;/p&gt;

&lt;p&gt;the following SQL statement:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="n"&gt;orderkey&lt;/span&gt; &lt;span class="nb"&gt;bigint&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;creationdate&lt;/span&gt; &lt;span class="nb"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;will create a table orders with a column named creationdate of type timestamp, with 6 digits of precision for the fraction of seconds.&lt;/p&gt;

&lt;h1&gt;
  
  
  Adding a new Data type
&lt;/h1&gt;

&lt;p&gt;Adding a new type, or in this case updating an existing one with new semantics, is a risky task.&lt;/p&gt;

&lt;p&gt;we are dealing with a system that is being used by thousands of users daily and any change to a type might end up with literally thousands of SQL queries that are broken.&lt;/p&gt;

&lt;p&gt;Let's see some of the considerations that the Trino team had during the design phase and then we will elaborate more on &lt;br&gt;
each one of them.&lt;/p&gt;

&lt;p&gt;First we have performance considerations.&lt;br&gt;
How can we make sure that we deliver the best possible performance when dealing with timestamps?&lt;br&gt;
The trick here is to consider different representations and functions based on the precision the user has defined.&lt;/p&gt;

&lt;p&gt;There's the major issue of backward compatibility.&lt;/p&gt;

&lt;p&gt;How do we ensure that the Timestamp semantics of the current implementation will not break with the introduction of parameterization?&lt;/p&gt;

&lt;p&gt;This compatibility is not just for the type itself but also for all the functions that are related to this type.&lt;/p&gt;

&lt;p&gt;Then we have the added complexity of Trino being a federated query engine.&lt;/p&gt;

&lt;p&gt;We need to make sure that the new Timestamp type can be correctly mapped to the equivalent types of each connector supported and do that in the most performant way possible.&lt;/p&gt;

&lt;p&gt;Finally, we need to make sure that we handle correctly data type conversions between different types and precisions.&lt;/p&gt;

&lt;p&gt;Each one of the above considerations is a huge fascinating topic on its own, so let's dive in!&lt;/p&gt;

&lt;h3&gt;
  
  
  From Logical to Physical representation
&lt;/h3&gt;

&lt;p&gt;We physically implement timestamps using integers.&lt;br&gt;
Obviously a 32bit integer can represent more timestamps than an 8bit one.&lt;/p&gt;

&lt;p&gt;So how many bits do we need for picosecond precision?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/trinodb/trino/wiki/Variable-precision-datetime-types" rel="noopener noreferrer"&gt;The answer can be found in the design document for the variable precision datetime types&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;For &lt;code&gt;p=12&lt;/code&gt; we need 79 bits which is a bit more than 64bits which is what the &lt;a href="https://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html" rel="noopener noreferrer"&gt;Long Java type supports&lt;/a&gt;. To represent higher than &lt;code&gt;p=7&lt;/code&gt; resolution Timestamps, we will need to come up with our own Java representation.&lt;/p&gt;

&lt;p&gt;what Trino does is the following:&lt;/p&gt;

&lt;p&gt;We will have one short encoding for any timestamp precision that can be represented by a Java Long Type. This is for an &lt;code&gt;p&amp;lt;=6&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;We will also introduce a long encoding for any timestamp precision &lt;code&gt;p &amp;gt; 6&lt;/code&gt; and up to the maximum we want to support, which in this case is 12. &lt;/p&gt;

&lt;p&gt;The fractional part will be 16 or 32 depending on what precision we want to support at the end.&lt;/p&gt;

&lt;p&gt;How does this look when implemented?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/trinodb/trino/blob/master/core/trino-spi/src/main/java/io/trino/spi/type/LongTimestamp.java" rel="noopener noreferrer"&gt;Let's see the current Trino implementation&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The LongTimestamp Class implements the type we described in the previous diagram and as you can see there's one long java type that is used to hold the seconds and then a int java type (which is a 32bit type) that is used to represent the fractional part of seconds with precision up to 32 bits.&lt;/p&gt;

&lt;p&gt;There's also an implementation for a Short Timestamp.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/trinodb/trino/blob/eef66628759d7244c176f62be45f3d9f0e5a1a5d/core/trino-spi/src/main/java/io/trino/spi/type/ShortTimestampType.java" rel="noopener noreferrer"&gt;let's see how this looks like in Trino&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We want the Short Timestamp to be represented by a 64bit integer and to do that, we explicitly associate this class with the Java long.class.&lt;/p&gt;

&lt;p&gt;Most of the implementation has been omitted to make it easy to understand how Short Timestamp is represented, but the writeLong method is a good place to see how the type is assumed to be a long.&lt;/p&gt;

&lt;p&gt;You might be wondering why we make our lifes so hard and we don't just simplify things by having only one Timestamp physical representation.&lt;/p&gt;

&lt;p&gt;The answer to this is performance. Trino is one of the most performant SQL query engines in the industry and this wouldn't have happened if we didn't squish every little bit of performance we could.&lt;/p&gt;

&lt;p&gt;When we talk about performance, there are two types we care about. One is storage performance, e.g. how we can reduce the storage we require to store information.&lt;/p&gt;

&lt;p&gt;The other one is processing time performance, or how we can make things run faster.&lt;/p&gt;

&lt;h3&gt;
  
  
  Saving Space &amp;amp; Time
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://github.com/trinodb/trino/blob/eef66628759d7244c176f62be45f3d9f0e5a1a5d/core/trino-spi/src/main/java/io/trino/spi/type/LongTimestampType.java" rel="noopener noreferrer"&gt;If we take a look at the LongTimestampType implementation&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Trino is using a BlockBuilder to write and read data for the specific type.&lt;br&gt;
What is interesting in the above code snippets is that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The size of the type is used as information passed to the BlockBuilder&lt;/li&gt;
&lt;li&gt;There's a special implementation of a BlockBuilder for Int96, if you remember, earlier we concluded that we will need 64 + 32 = 96 bits to represent our Long Timestamps.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If we take a look at the &lt;a href="https://github.com/trinodb/trino/blob/eef66628759d7244c176f62be45f3d9f0e5a1a5d/core/trino-spi/src/main/java/io/trino/spi/type/ShortTimestampType.java" rel="noopener noreferrer"&gt;ShortTimestampType implementation&lt;/a&gt;, we'll notice that the BlockBuilder used is different.&lt;/p&gt;

&lt;p&gt;You will notice that the API used is the same but the BlockBuilder used is different, now we use a LongArrayBlockBuilder instead of a Int96ArrayBlockBuilder.&lt;/p&gt;

&lt;p&gt;The Block part of the Trino SPI is a very important part of the query engine as it is responsible for the efficient encoding of the data that is processed&lt;/p&gt;

&lt;p&gt;Any decision in the Block API can affect greatly the performance of the engine.&lt;/p&gt;

&lt;p&gt;Now you might wonder why I went through all these examples and why the Trino engineers went through the implementation of different block writers.&lt;/p&gt;

&lt;p&gt;The reason is simple, it has to do with the space efficiency of the data types.&lt;/p&gt;

&lt;p&gt;Remember that the ShortTimestampType is represented by a 64 long java type while a LongTimestampType by a 64 long java type plus a 32 int java type. Based on that, the memory required to store a timestam of each type is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ShortTimestampType: 8 bytes&lt;/li&gt;
&lt;li&gt;LongTimestampType: 16 bytes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We need twice the space for a timestamp with precision more than microseconds.&lt;br&gt;
Keep in mind that if we use a LongTimestampType we will use 16 bytes of memory even if we don't have picosecond precision.&lt;/p&gt;

&lt;p&gt;This is 2x the need in memory and that's a lot when we are working with petabytes of data!&lt;/p&gt;

&lt;p&gt;Let's see now what happens with time complexity and if there's any performance difference for typical type operations like comparison between the two timestamp types.&lt;/p&gt;

&lt;p&gt;To do that we will need the implementation of &lt;a href="https://github.com/trinodb/trino/blob/eef66628759d7244c176f62be45f3d9f0e5a1a5d/core/trino-spi/src/main/java/io/trino/spi/type/ShortTimestampType.java" rel="noopener noreferrer"&gt;ShortTimestampType&lt;/a&gt; and &lt;a href="https://github.com/trinodb/trino/blob/eef66628759d7244c176f62be45f3d9f0e5a1a5d/core/trino-spi/src/main/java/io/trino/spi/type/LongTimestampType.java" rel="noopener noreferrer"&gt;LongTimestampType&lt;/a&gt; let's check the difference for the comparisonOperator.&lt;/p&gt;

&lt;p&gt;The ShortTimestampType comparisonOperator involves one comparison that uses the Long.compare method.&lt;/p&gt;

&lt;p&gt;On the other hand the LongTimestampType might potentially involve a second comparison, so in the worst case we have to compare one long and one int type.&lt;/p&gt;

&lt;p&gt;There are substantial performance gains both in time and storage by adopting this more complex, dual type, timestamp implementation.&lt;/p&gt;

&lt;p&gt;Decisions like this one in every part of the engine, is what makes Trino such a performant query engine.&lt;/p&gt;

&lt;h2&gt;
  
  
  Move Fast but Don't Break things
&lt;/h2&gt;

&lt;p&gt;Trino is used by thousands of users on a daily basis so it's super important, whenever something new is introduced to e&lt;br&gt;
nsure that nothing will break.&lt;/p&gt;

&lt;p&gt;Backward compatibility is important and it's part of the design of any new feature.&lt;/p&gt;

&lt;p&gt;To ensure backward compatibility it was decided to first tackle the language and the data types by maintaining current semantics while adding parametarization.&lt;/p&gt;

&lt;p&gt;By current semantics we mean the precision that was supported by Trino at that time which was p=3.&lt;/p&gt;

&lt;p&gt;Updating the types alone is not enough though, there are a number of special functions that have to be parameterized.&lt;/p&gt;

&lt;p&gt;These functions are current_time, locatime, current_timestamp and localtimestamp where the user will be able to provide a parameter for precision while supporting the current semantics as defaults.&lt;/p&gt;

&lt;p&gt;Together with the above mentioned functions all the connectors that accept the timestamp types in their DDL will have to be updated together with all functions and operators that operate on these types.&lt;/p&gt;

&lt;p&gt;This is just the first step in building support for parametrized timestamp types and the purpose of this step is to introduce the appropriate changes without breaking anything to the existing users by enforcing the right defaults.&lt;/p&gt;

&lt;p&gt;We also need to ensure that there's a way to safely cast a Timestamp with p1 precision to a Timestamp with p2 precision. &lt;/p&gt;

&lt;p&gt;To do that, we need to implement logic for truncation and rounding.&lt;/p&gt;

&lt;p&gt;The logic for these operations can be found in the implementation of the &lt;a href="https://github.com/trinodb/trino/blob/eef66628759d7244c176f62be45f3d9f0e5a1a5d/core/trino-spi/src/main/java/io/trino/spi/type/Timestamps.java" rel="noopener noreferrer"&gt;Timestamps&lt;/a&gt; class.&lt;/p&gt;

&lt;p&gt;The above two methods are implementing logic for truncating micros to millis (the first one) and rounding micros to millis (the second one).&lt;/p&gt;

&lt;h2&gt;
  
  
  Time is Relative
&lt;/h2&gt;

&lt;p&gt;Trino is a federated query engine, that means that it allows someone to execute a query on it that is going to be pushed down to different systems.&lt;/p&gt;

&lt;p&gt;These systems usually do not share the same semantics. This is especially true for date/time types.&lt;/p&gt;

&lt;p&gt;For Trino to consider a new feature for connectors, it first has to be implemented and released for Hive.&lt;/p&gt;

&lt;p&gt;Hive is important because it is heavily used together with Trino but because it also acts as the reference implementation for any other connector.&lt;/p&gt;

&lt;p&gt;The first step was to add support for &lt;a href="https://github.com/trinodb/trino/issues/3977" rel="noopener noreferrer"&gt;variable-precision timestamp&lt;/a&gt; type for Hive connector. You might notice on this issue that the Hive metastore supports timestamps with optional &lt;a href="https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-TimestampstimestampTimestamps" rel="noopener noreferrer"&gt;nanosecond precision&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;What that means is that although Trino can handle up to picosecond time resolution, when working with Hive, we can only use up to nanosecond.&lt;/p&gt;

&lt;p&gt;This is a common pattern and different datastores will have different restrictions, for example the Postrgres connector can handle up to microsecond resolution.&lt;/p&gt;

&lt;p&gt;That means that the person who implements support for the new Timestamp parameters for the Postgres connector will have to account for this and ensure that the right type casts happen.&lt;/p&gt;

&lt;p&gt;After Hive has been supported a number of other connectors and client followed, e.g. updating the CLI, the JDBC and the Python clients. These are some of the most frequently used connectors and clients.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;There are always trade-offs when you are working with temporal data, in most cases you will have to change multiple times the way you work with time while modeling and analysing your data.&lt;/p&gt;

&lt;p&gt;The responsibility of the query engine is to provide you with all the tools you need to do that while ensuring the soundness and correctness of data processing.&lt;/p&gt;

&lt;p&gt;Now that you have a basic understanding of how types work in Trino, I encourage you to take a deeper look into the codebase and dig deeper into how these types are serialized and then processed.&lt;/p&gt;

</description>
      <category>portfolio</category>
      <category>howto</category>
      <category>tooling</category>
      <category>discuss</category>
    </item>
  </channel>
</rss>
