Forem: Kostas Pardalis

How to Build a Deep Research Agent with Pydantic AI

Kostas Pardalis — Mon, 17 Nov 2025 20:11:04 +0000

"HN is amazing for discovery, terrible for structured research."

If you hang out on Hacker News, you know the feeling: you see a great thread, think "I should come back to this", and… never do. A week later, you're trying to answer a question like:

"How has HN's opinion on Rust vs Go changed over time?"
"What does HN actually think about LangChain-style agent frameworks?"

HN's built-in search is fine for keywords, but not for questions about themes, opinions, and trends.

What we really want is to ask higher-level questions about topics, threads, and time windows like:

"Show me discussions about e.g. Rust in the last 6 months."
"Compare how remote work was discussed in 2021 vs 2024."
"Summarize the main arguments for and against LLM agents across top HN threads."

That's where fenic comes in: think of it as a dataframe + context layer built for LLM-powered analysis. You declare what data you care about, use regular + semantic transforms to shape it, and then plug that into an agent loop.

This post walks through how we use fenic to turn a raw Hacker News dataset into a small but powerful "deep research" agent.

👉 Full project:

fenic repo: https://github.com/typedef-ai/fenic
Example code: https://github.com/typedef-ai/fenic-examples/tree/main/hn_agent
HN dataset: https://huggingface.co/datasets/typedef-ai/hacker-news-dataset

2. What we'll build

We'll build a small research agent that:

Loads an HN dataset (stories, comments, metadata).
Lets you filter and slice discussions by topic, time, and signals.
Uses LLMs to summarize, compare, and extract themes from those slices.
Wraps it all in a simple loop: user question → fenic dataframe query → LLM analysis → answer + links back to HN.

Conceptually, the pipeline looks like this:

Data layer: fenic DataFrames over the HN dataset.
Query layer: reusable "research queries" expressed as fenic transformations.
LLM layer: fenic semantic operators (and/or UDFs) to summarize/compare.
Agent loop: something like PydanticAI or your framework of choice to orchestrate.

3. Setting up fenic

You'll need:

Python 3.10+
An LLM provider key (OpenAI, Anthropic, etc.)
uv package manager
git

Install with uv

# Clone the example repo
git clone https://github.com/typedef-ai/fenic-examples.git
cd fenic-examples/hn_agent

# Install dependencies with uv
uv sync

Set your LLM API key(s) and HuggingFace token:

export OPENAI_API_KEY="sk-..."
export HF_TOKEN="your-huggingface-token"

Inside this folder you'll find:

A notebook / script that wires together fenic + PydanticAI.
Helper functions to load the Hacker News dataset.
A simple agent loop that you can run locally.

If you just want to run it and poke around, start there. The rest of this post explains the pieces.

4. Loading the Hacker News data

The dataset we're using is published as a public Hugging Face Dataset:

👉 https://huggingface.co/datasets/typedef-ai/hacker-news-dataset

At a high level it contains:

Stories: id, title, url, by, time, score, etc.
Comments: id, parent, story_id, by, time, text
Metadata: type (story, comment), deleted flags, etc.

Here's how the actual data loader works in the hn_agent project:

from hn_agent.session import get_session

def load_hn_data(verbose: bool = True) -> None:
    """Load all Hacker News data from HuggingFace into local tables."""
    session = get_session()
    base_path = "hf://datasets/typedef-ai/hacker-news-dataset/data"

    # All 2025 data files to load
    files_to_tables = {
        "2025_comments": "comments",
        "2025_items": "items",
        "2025_stories": "stories",
        "2025_jobs": "jobs",
        "2025_polls": "polls",
        "2025_pollopts": "pollopts",
        "2025_users": "users",
        "2025_user_submissions": "user_submissions",
        "2025_item_children": "item_children",
        "2025_item_parts": "item_parts"
    }

    # Load each file into its own table
    for file_name, table_name in files_to_tables.items():
        if verbose:
            print(f"Loading {file_name}...")

        df = session.read.parquet(f"{base_path}/{file_name}.parquet")
        df.write.save_as_table(table_name, mode="overwrite")

        if verbose:
            count = df.count()
            print(f"  ✓ Loaded {count:,} records into {table_name}")

The session is configured with semantic support:

from fenic.api.session import Session
from fenic.api.session.config import SessionConfig, SemanticConfig, OpenAILanguageModel

config = SessionConfig(
    app_name="hn_agent",
    db_path=str(data_dir),
    semantic=SemanticConfig(
        language_models={
            "gpt4": OpenAILanguageModel(
                model_name="gpt-5-nano",
                rpm=100,
                tpm=100000
            )
        }
    )
)
session = Session.get_or_create(config)

To run the data loading:

HF_TOKEN=$HF_TOKEN uv run python -m hn_agent.data.loader

This downloads ~2.5M comments and ~500K stories from 2025 into a local fenic database. The loader also denormalizes data into optimized lookup tables (comment_to_story, story_threads, story_discussions) that eliminate recursive SQL queries during tool execution.

5. Defining research queries

Instead of hard-coding one-off scripts, we treat "research questions" as reusable dataframe transformations exposed as MCP tools.

Here's how the search tool is registered in the actual project:

import fenic.api.functions as fc
from fenic.core.types.datatypes import StringType
from fenic.core.mcp.types import ToolParam

def register_story_search_tool(session, tool_name: str = "search_stories") -> None:
    """Register a regex-based HN story search tool."""
    catalog = session.catalog

    # Get tables
    items = session.table("items").filter(fc.col("type") == fc.lit("story"))
    comments = session.table("comments")
    comment_to_story = session.table("comment_to_story")  # Denormalized table

    # Tool parameters
    pattern = fc.tool_param("pattern", StringType)

    # Story-side matches
    title_match = fc.coalesce(fc.col("title"), fc.lit("")).rlike(pattern)
    url_match = fc.coalesce(fc.col("url"), fc.lit("")).rlike(pattern)
    story_text_match = fc.coalesce(fc.col("text"), fc.lit("")).rlike(pattern)

    story_hits = (
        items.with_column("title_match", title_match)
        .with_column("url_match", url_match)
        .with_column("text_match", story_text_match)
        .with_column(
            "match_rank",
            fc.when(fc.col("title_match"), fc.lit(1))
            .when(fc.col("url_match"), fc.lit(2))
            .when(fc.col("text_match"), fc.lit(3))
            .otherwise(fc.lit(999)),
        )
        .filter(fc.col("title_match") | fc.col("url_match") | fc.col("text_match"))
        .select(
            fc.col("id").alias("story_id"),
            fc.col("title"),
            fc.col("by").alias("author"),
            fc.col("time").alias("published_at"),
            fc.col("score"),
            fc.col("descendants").alias("comment_count"),
            fc.col("url"),
            fc.col("match_rank"),
        )
    )

    # Comment-side matches - use denormalized lookup table (no recursion!)
    comment_text_match = fc.coalesce(fc.col("text"), fc.lit("")).rlike(pattern)
    matched_comments = (
        comments
        .filter(comment_text_match)
        .select(fc.col("id").alias("comment_id"))
        .limit(5000)
    )

    # Fast lookup using denormalized comment_to_story table
    comment_stories = (
        matched_comments
        .join(comment_to_story, on="comment_id")
        .select(fc.col("story_id"))
        .drop_duplicates(["story_id"])
    )

    # Combine and rank results
    unified = story_hits.union(comment_hits).drop_duplicates(["story_id"])
    sorted_results = unified.sort([fc.col("match_rank"), fc.col("published_at").desc()])

    # Register the tool
    catalog.create_tool(
        tool_name=tool_name,
        tool_description="Search Hacker News stories using regex patterns...",
        tool_query=sorted_results,
        tool_params=[
            ToolParam(name="pattern", description="Regular expression pattern to search for.")
        ],
        result_limit=100,
    )

Results are ranked by relevance:

Title matches (most relevant)
URL matches
Story text matches
Comment matches

6. Adding LLM-powered analysis

fenic has first-class semantic operators that wrap LLM calls as dataframe operations (with batching, retries, cost tracking, etc.). That lets you say:

"For each group of comments, ask the model to summarize / classify / extract structure."

Summarize threads with structured output

Here's how the summarize_story tool works with Pydantic models:

from pydantic import BaseModel, Field
from typing import List
import fenic.api.functions.semantic as semantic

class DiscussionTheme(BaseModel):
    """Represents a theme or topic within a discussion."""
    topic: str = Field(description="Name of the discussion theme")
    summary: str = Field(description="Concise summary of the theme, viewpoints, and evidence")
    stance_spectrum: str = Field(default="", description="How opinions vary across this theme")
    representative_comment_ids: List[int] = Field(
        default_factory=list, description="Example comment IDs relevant to this theme"
    )
    off_topic: bool = Field(default=False, description="True if this theme is off-topic")


class StorySummary(BaseModel):
    """Structured summary of a Hacker News story and its discussion."""
    tl_dr: str = Field(description="Two-sentence top summary")
    story_overview: str = Field(description="Short overview of the story itself")
    key_points: List[str] = Field(default_factory=list, description="Key points and takeaways")
    discussion_themes: List[DiscussionTheme] = Field(
        default_factory=list, description="Themes across the discussion"
    )
    variety_present: bool = Field(description="Whether discussion splits into distinct topics")
    off_topic_themes: List[str] = Field(default_factory=list, description="Names of off-topic themes")
    risks_or_concerns: List[str] = Field(default_factory=list, description="Risks or concerns raised")
    actionables: List[str] = Field(default_factory=list, description="Any concrete action items")
    sources: List[int] = Field(default_factory=list, description="Referenced comment IDs")
    truncated_input: bool = Field(description="True if input was truncated due to size")

The summarization uses fenic's semantic.map with a structured prompt:

CONCISE_PROMPT = """Summarize this Hacker News discussion in {{ language }}:

Story: {{ title }} ({{ domain }})
URL: {{ url }}
Score: {{ score }}, Comments: {{ descendants }}
Published: {{ published_at }}

Discussion thread:
{{ transcript }}

Create a structured summary including:
1. TL;DR (max 2 sentences)
2. Story overview (brief)
3. Key points from discussion
4. Main discussion themes with viewpoints and stances
5. Off-topic themes if present
6. Risks/concerns raised
7. Action items mentioned

{{ extra_instructions }}"""

summary_col = semantic.map(
    CONCISE_PROMPT,
    response_format=StorySummary,
    model_alias=model_alias,
    temperature=0.0,
    max_output_tokens=768,
    title=fc.col("title"),
    url=fc.col("url"),
    domain=fc.col("domain"),
    published_at=fc.col("published_at"),
    score=fc.col("score"),
    descendants=fc.col("descendants"),
    transcript=fc.col("transcript_limited"),
    extra_instructions=fc.col("extra"),
    language=fc.col("lang"),
)

with_summary = with_discussion.select(
    fc.col("story_id"),
    fc.col("title"),
    # ... other columns
    summary_col.alias("summary"),
)

Now each row has a typed summary object you can:

Access nested fields directly
Aggregate stance ratios by year
Join back to scores, authors, etc.

7. Putting it together as an agent loop

fenic takes care of data + context. To make this interactive, we wrap it in a small agent loop using PydanticAI.

Here's the actual research agent from the project:

from pydantic import BaseModel, Field
from pydantic_ai import Agent
from pydantic_ai.mcp import MCPServerStreamableHTTP
from typing import Any, Dict, List

class DeepResearchReport(BaseModel):
    """Structured output for research findings."""
    question: str
    method: List[str] = Field(default_factory=list, description="Methods used to research")
    key_findings: List[str] = Field(default_factory=list, description="Main discoveries")
    themes: List[Dict[str, Any]] = Field(default_factory=list, description="Common themes across stories")
    controversies: List[str] = Field(default_factory=list, description="Points of disagreement")
    sources: List[Dict[str, Any]] = Field(default_factory=list, description="Story IDs and titles")
    limitations: List[str] = Field(default_factory=list, description="Research limitations")


SYSTEM_PROMPT = """
You are a deep research agent analyzing Hacker News discussions via MCP tools.

Available tools:
- search_stories(pattern): Find stories matching a regex pattern
- summarize_story(story_id): Get AI summary of a story and its discussion
- read_story(story_id): Get full story with comment tree (use sparingly)

Research process:
1. Use search_stories to find relevant content (max 5 searches, limit 10 per search)
2. Use summarize_story on the most relevant stories
3. Only use read_story if you need specific metadata not in summaries
4. Synthesize findings across all stories

Important:
- Keep search patterns broad initially, then refine
- Always cite story IDs in your findings
- Don't paste raw tool outputs into context
- Focus on patterns and insights across multiple stories

Return a JSON object matching the DeepResearchReport schema.
"""


async def run_research_async(question: str, max_stories_to_summarize: int = 8) -> DeepResearchReport:
    """Run deep research on a Hacker News topic."""
    mcp_url = os.getenv("HN_MCP_URL", "http://localhost:8080/mcp")

    # Create MCP connection
    mcp_server = MCPServerStreamableHTTP(url=mcp_url)

    # Create agent with structured output
    agent = Agent(
        "openai:gpt-5",
        system_prompt=SYSTEM_PROMPT,
        toolsets=[mcp_server],
        output_type=DeepResearchReport,
        output_retries=2
    )

    # Build user prompt
    user_prompt = f"""Research question: {question}

Please investigate this topic across Hacker News stories and discussions.
Budget: max 5 searches, summarize up to {max_stories_to_summarize} stories.
Focus on finding diverse perspectives and recurring themes."""

    result = await agent.run(user_prompt)
    return result.output

The MCP server that exposes the tools is simple:

from fenic.api.mcp.server import create_mcp_server, run_mcp_server_sync
from hn_agent.session import get_session
from hn_agent.tools.tools import register_tools


def start_server(port: int = 8080) -> None:
    """Start the HTTP MCP server."""
    session = get_session()

    # Register tools first with the same session
    register_tools(session=session)

    # Get all tools from catalog
    catalog = session.catalog
    tools = catalog.list_tools()

    # Create the MCP server with tools
    server = create_mcp_server(
        session=session,
        server_name="hn_agent",
        tools=tools
    )

    # Run the server with HTTP transport
    print(f"Starting MCP server on http://localhost:{port}")
    run_mcp_server_sync(
        server=server,
        transport="http",
        port=port
    )

To run everything:

# Terminal 1: Start MCP server
uv run python -m hn_agent.mcp.server

# Terminal 2: Run research queries
OPENAI_API_KEY=$OPENAI_API_KEY uv run python -m hn_agent.cli "What are concerns about AI safety?"

# With options
OPENAI_API_KEY=$OPENAI_API_KEY uv run python -m hn_agent.cli --max-stories 10 "Latest LLM developments"

# Output as JSON
OPENAI_API_KEY=$OPENAI_API_KEY uv run python -m hn_agent.cli --json "Rust vs Go discussions"

The pattern is the same:

Dataframe slice → semantic transforms → agent consumes the results.

8. Where to go next

Everything here is just one concrete instantiation of a more general pattern:

Swap in your own datasets
- Company forum threads
- Support tickets
- Slack exports
- Internal RFCs / design docs
Reuse the same fenic primitives
- Filter/slice on metadata (teams, product areas, time windows).
- Use semantic.map with Pydantic models for structured extraction.
- Use semantic.extract for pulling typed data from text.
- Use fc.tool_param to create parameterized MCP tools.
Combine with other fenic examples
- Use the semantic join examples to correlate HN threads with logs, incidents, or docs.
- Use the clustering capabilities to group similar discussions together.
- Use the Hugging Face Datasets integration to hydrate other versioned datasets into DataFrames with one line.

If you want to dig deeper:

fenic core: https://github.com/typedef-ai/fenic
This HN agent example: https://github.com/typedef-ai/fenic-examples/tree/main/hn_agent
Hacker News dataset on HF: https://huggingface.co/datasets/typedef-ai/hacker-news-dataset

I'd love to hear how you adapt this pattern, whether it's for HN, your company's internal knowledge, or other messy discussion data.

DX UX = U DX UX =

Kostas Pardalis — Thu, 05 Sep 2024 20:40:44 +0000

User Experience is primarily concerned about guiding the user to a desired outcome in the most optimal way, optimizing for time and margin of error. That’s why the term journey is heavily used within the context of UX design.

Developer Experience on the other hand, is not about guiding the user, but designing the right abstractions and choosing what part of the system complexity to expose to the developer.

I tend to think of UX as building guardrails and DX as crafting a toolkit. While UX aims to create a smooth, intuitive path for users, DX focuses on providing developers with powerful, flexible tools that allow them to build efficiently and effectively.

In UX, we’re often trying to anticipate user needs and minimize cognitive load. We create interfaces that are self-explanatory and workflows that feel natural. The goal is to make the product as easy to use as possible, even for first-time users.

DX, however, is about empowering developers. It’s about creating APIs, frameworks, and development environments that are not necessarily simple on the surface, but are logically structured and well-documented. Good DX allows developers to harness complex functionality without getting bogged down in unnecessary details.

Both UX and DX share a common goal of efficiency, but they approach it from different angles. UX seeks to reduce friction for end-users, while DX aims to maximize productivity for developers. In the best scenarios, great DX leads to better UX, as developers are able to create more robust, performant, and feature-rich applications.

In DX, the tools we craft must feel native to the way each type of engineer understands the world. It’s crucial to recognize that different engineering disciplines often operate with distinct mental models and vocabularies, even when working with seemingly similar concepts. For instance, a data engineer and an application engineer may not share the same semantic understanding, despite potentially using identical syntax.

Consider the term “partition” as an example. A data engineer might conceptualize partitions in terms of distributing large datasets across multiple storage units for efficient processing and querying. In contrast, an application engineer working with Kafka might think of partitions as a way to organize and parallelize message streams.

While the word is the same, the underlying concepts and implications differ significantly based on the engineer’s domain of expertise.

Therefore, when designing tools and abstractions for DX, we must tailor them to align with the specific mental models and workflow patterns of each engineering discipline. This approach ensures that our tools not only provide functionality but also resonate with the intuitive understanding and problem-solving approaches of the engineers using them.

failing in doing a good job with this alignment, leads to the inefficiency that plagues much of today’s compute infrastructure, the frustration that practitioners and a steep learning curve that ends up being the reason we are not seeing the growth in new practitioners entering the market that we could otherwise expect.

The tools considered today as the foundation of the upcoming AI and data revolution were built in a different era, designed for a vastly different user base that cannot address the diverse range of practitioners and use cases we have today.

If we want to realize the potential of AI and data, we must fundamentally rethink the tools we have today, to align with the needs and skills of current and future users in mind.

Failing to do this, will risk the success of these emerging technologies.

Why you should keep an eye on Apache DataFusion and its community.

Kostas Pardalis — Tue, 09 Jul 2024 05:32:43 +0000

On June 24, 2025, the first San Francisco Bay Area DataFusion meetup happened. I had the opportunity to help with the organization of the event and also attend.

The event had a lot of content from six different companies. These companies ranged from startups to scale-ups and big Fortune 500 companies. Leaving the event, I felt I had experienced something significant, and I want to share it with you.

And trust me, you don't want to miss out on this!

What are you talking about, dude?

In case you don't know what Apache DataFusion is, here's the high-level blurb.

DataFusion is a very fast, extensible query engine for building high-quality data-centric systems in Rust

It's a pretty good description of what technically DataFusion is, but like many amazing open source projects, it sells short itself.

Here are a few reasons why I say that, and by the end of this list, you will have all the reasons why you should pay close attention to the future of this project.

First, the technology

Databases are notoriously hard to build and get to market. So hard that there's a whole graveyard of database systems that were built and never made it into a product.

The reason for that is simple. Databases are just very complex systems.

They stand up there together with operating systems and compilers in terms of technical complexity.

Operating systems abstracted this complexity with the genius of Linux. There's the kernel and then a whole set of layers that build functionality in both user-land and kernel-land.

Similarly, LLVM revolutionized the world of programming languages and compilers. Since its creation, we've seen many new languages being created of increased complexity.

But databases are still waiting for their LLVM moment. Until today, if you wanted to build a database system, you pretty much had to build every piece of it.

Design the grammar of the query language
Build a parser
Figure out an intermediate representation
Logical plans
Optimizations of logical plans
Query optimizers
Physical plans
Execution engines
Storage

And all that while fighting constantly with performance and correctness.

It can be done, but it takes a lot of time, and in the world of technology, time is the only resource you don't really have.

As a result, most companies that tried to market a new database, didn't have enough time to figure out what the market needed.

DataFusion is changing this.

Its design lets a team focus on a specific part of the database system they want to change. They can then reuse the rest, which greatly reduces the time it takes to get the product to the market.

Taking a look at the companies who are using DataFusion today, is a testament to that claim.

LanceDB, Cube.dev, InfluxData, Denormalized, and Greptime are building completely different products. What they have in common, though, is that their products are a database system at their core and they are also using DataFusion to build them.

Each project is innovating on a different part of a database system. They also are reusing the rest as DataFusion provides them out of the box.

The community

DataFusion is a young open-source project, but has managed to build a very healthy community.

That was evident at the meetup event, where everyone was there to share knowledge and seek opportunities to contribute back.

Building such a community is not easy and it's primarily the result of the hard work a very small number of people are doing. Andrew Lamb and Andry Grove have done an amazing job so far, and they deserve recognition for that.

Toxicity and bad governance is what kills many open-source projects, but what I've experienced from the community so far, makes me feel very optimistic about the future.

Having said that, the work of these folks shouldn't be taken for granted. Everyone who benefits from the project and the community, should try to support it in whatever way they can.

Governance & ownership

DataFusion is blessed to be an open-source project that doesn't have a single company maintaining it.

The open-core model of monetizing software has left a very bitter taste in the mouths of many practitioners. Hashicorp and Databricks are just a small example of that.

We need a different model for building monetary value over open-source. Projects like Apache Arrow and Apache DataFusion are a great example of how a better future could look like.

All the companies I mentioned in the previous section benefit from DataFusion and contribute back to the project. They also monetize their technology and build their moats, without being antagonistic to the project.

The stars are aligned

Finally, the market is looking for solutions to problems that will require a lot of innovation to happen in data management systems.

The rise of new use cases like AI and ML are pushing existing solutions to their limits.

We need to build and we don't have the luxury of iterating over 5+ years to just get a demo out there to the market.

DataFusion and the rest of the Arrow ecosystem is the foundation that will enable that, and it's already happening.

The companies that presented at the Bay Area meetup collectively received over $200 million in funding.

All are using DataFusion for critical parts of their products and contribute back to the project.

Conclusions

The above are just a few of the reasons that make DataFusion such a special project. It's still early, but the future looks really bright.

I hope I convinced you to keep an eye on the project, and if not, reach out and let me know why. I'm happy to hear your thoughts.

I'll leave you for now with a prediction that the 1,000 projects built on DataFusion is not that far away!

A glimpse into the future of data processing infrastructure.

Kostas Pardalis — Thu, 02 May 2024 18:49:22 +0000

Three weeks ago, VeloxCon took place in San Jose. The event was a great opportunity for people who are interested in execution engines and data processing at scale to learn about the current state of the project.

Most importantly, though, it was an amazing opportunity to get a glimpse of what the future of data processing will be like. From what we saw at the event, this future is very exciting!

Let's get into more details of what happened at the event and why it's important.

First, what is Velox?

Velox is an open-source unified execution engine created and open-sourced by Meta, aiming to commoditize execution in data management systems.

You can think of the execution engine as the component of a data management system that is responsible for processing the data.

This part is usually one of the most time-consuming to build and also the one that is the most demanding regarding correctness. We might assume that whenever we run an SUM aggregation function, the result will always be correct. This is only possible because dedicated engineers invest countless hours guaranteeing that your database functions correctly.

Why is it that Velox could commoditize execution in data management systems? Because execution is such a hard problem to solve. If we had a library that always does the right thing and also does it in the best way possible, we could build the other services around it..

Why does this matter? Because if we can succeed in that, then systems and database engineers can design modular data management systems and iterate faster. As a result, users of these systems can enjoy better products and more innovation reaching them faster than ever before.

What happened at VeloxCon this year

The conference happened in the period of two days and it was packed with interesting, and as expected, very technical talks. The high-level structure of the event was the following:

Day 1 was all about updates on the current state of the project
- New features that have been delivered to Velox
- Updates on who is currently using it in production and how
- Open Source Project management and governance updates
Day 2 was all about the future of Velox and data management systems
- A shift into use cases and workloads for data management systems
- A lot of hardware, which might sound surprising for such a conference, however, it's one of the most interesting signals for the future of data management

Every single presentation that was presented at VeloxCon deserves its own post. Instead, I'm going to share the takeaways that I believe are the most important. I'd suggest you then go to the YouTube channel of the conference and watch all of the presentations.

Take #1: For analytical workloads, it's all about performance optimization

Today, we know how to manage these workloads. We also need to make them accessible to everyone. The market demands their commoditization. For analytical workloads at any scale, it's all about optimization.

Optimization comes in two forms. I would argue that these two forms are just different sides of the same coin; one is performance optimization and the other is cost optimization.

The good news is that there's huge opportunity for delivering value here.

My take is that in the coming years we will see more performance and auto-tuning coming from the systems themselves. As a result, practitioners will spend more time building than operating.

What's new in Velox - Jimmy Lu, Meta

Velox I/O optimizations - Deepak Majeti, IBM

Take #2: Defining and measuring performance is a very hard problem

This is not just an engineering problem. It's a market problem, too.

TPCH and TPCDS are useful for understanding the operators a data management system supports. However, they are not enough to decide if a system is the best for your workloads.

Just take a look at the presentation about what's new in Velox to see the optimizations that are being discussed and the improvements they are talking about.

The space for possible optimizations for data processing is just enormous. So how do you choose where to focus and what to go after?

My take is that although TPC-* benchmarks are important tools, there's a lot more to be done on this front. Unfortunately, bench-marketing has been a plague in this industry and it's quite easy for benchmarking suites to turn into marketing tools without much substance.

Because of that, a different approach to benchmarking is important. It should be more of defining frameworks and tooling to create custom benchmarks that fit the workloads of each user.

Now, the question of when to benchmark a data processing system will not be about whether it's complete and accurate. Systems like Velox will make sure of that. Instead, we'll focus on how well a particular system performs based on benchmarks that come from the user's actual work.

An update on the Apache Gluten project incubator and its use of Velox - Binwei Yang

Take #3: Databricks moats are holding strong

When I first learned about the Gluten project from Intel, I thought Databricks was going to be in trouble.

Photon was a great advantage for them when they split from Apache Spark. If Apache Spark now gets a similar execution engine, it'll be harder for people to switch to Databricks.

Especially if something like EMR Spark gets an execution engine comparable to Photon.

We are far from seeing widespread production use of WebAssembly. Although it is gaining momentum, and we have seen initial production deployments, there are still significant gaps that need to be addressed.

Initially, there were missing Spark functions in Velox. However, the gap is narrowing quickly.
Second, implementing a function is one thing. Guaranteeing that the implementation is semantically, and performance equivalent to Spark is another thing. I'll get back to this later because there are some very interesting insights from the event regarding workload migrations to new engines.
Third, PySpark and dataframe support is not there yet and that's a big issue as these two APIs have become an important driver of Spark adoption.
Finally, UDFs for Spark need to be figured out. A lot of business logic is done as User-Defined Functions on Spark. Moving these functions to a new system while making sure they still work and perform better isn't easy.

We are getting closer to having a similar to Photon execution engine on Apache Spark, however it will take more time before we have something that can threaten Databricks.

Accelerating Spark at Microsoft using Gluten & Velox - Swinky Mann and Zhen Li, Microsoft

Unlocking Data Query Performance at Pinterest - Zaheen Aziz

Take #4: Databricks might not be in trouble, but maybe Snowflake is?

One of the interesting things I learned at the event was that Presto's integration of Velox is much more mature than that of Spark.

Although I initially expected Databricks to be the primary focus, it turned out to be different. The implementation of Velox in Spark is not progressing as quickly as it is in Presto.

Currently, Prestissimo inside Meta has been fast replacing the old good Presto and it has reached a great level of maturity.

Traditionally, Presto is employed for conducting interactive analyses on a data lake. It operates similarly to a data warehouse such as Snowflake, except it uses a more flexible infrastructure and also offers query federation capabilities.

At the same time, the performance improvements that are being reported for Prestissimo are quite amazing. Being more performant means more flexibility for trading off cost over latency while enabling new workloads that were not possible before. Previously, it was uncommon to perform ETL on Presto, but now it is becoming increasingly common.

If Presto performs as well as Snowflake while having a more open design, what will stop people from using it instead of Snowflake?

I would argue here that the main obstacle for this to happen is the operational burden on Presto and Trino. Setting up and running these systems is often a harder task than doing the same for Spark.

Migrating away from Snowflake might not be a difficult decision. Systems like Athena and Starburst Galaxy are improving their developer experience, and the performance of systems like Velox is on par with Snowflake.

This makes me think that data warehousing will become a commodity much quicker than Spark and its workloads.

Prestissimo at IBM - Aditi Pandit, IBM

Prestissimo Batch Efficiency at Meta - Amit Dutta, Meta

Take #5: We need more developer tooling around data infrastructure

In the past, data tools were built for data analysts, not data engineers. Data quality platforms were designed with attractive user interfaces, while catalogs were created to rival the user experience of top-notch SaaS applications.

However, the future of data infrastructure is going to be a little bit different.

Like in app development, the key to making data teams more productive lies not in UX, but in DX.

There's a great need for tooling for developers who are responsible for building and maintaining data platforms. One great example of that is the lack of good fuzzers for testing data platforms.

Jimmy Lu explained how Velox uses fuzzers to make sure Prestissimo-Velox performs as expected, compared to Presto's standard execution engine.

This type of tooling is important for any team that is trying to do a migration. Not just from one vendor to another, but even between different versions of the same piece of infrastructure.

Just ask anyone involved in migrating from Hive to anything else how long it took to properly migrate away from it with confidence.

Talks to check

An update on the Apache Gluten project incubator and its use of Velox - Binwei Yang

Prestissimo Batch Efficiency at Meta - Amit Dutta, Meta

Unlocking Data Query Performance at Pinterest - Zaheen Aziz

Take #6: There's a tectonic swift in the importance of data workloads

Nimble's presentation has over 7K views, while others have just over 1K.

Parquet has been the foundation of large-scale data processing for over 10 years. Now, people are starting to build something to replace it. This is significant for the industry.

ML workloads are becoming more and more important and mainstream than they used to be. That's the shift we are talking about here.

Analytical workloads will keep growing, but the market demands that ML workloads run more efficiently and reach more people.

To connect with what we've said earlier about analytical workloads, markets are looking for infrastructure that can deliver efficiencies for them. At the same time, it is looking for technologies that can enable ML workloads at scale, which are new types of workloads and use cases.

What is an interesting observation, though, is that some of the fundamental parts of data infrastructure do not have to change to serve both workloads. Velox is a great example of this!

Theseus: A composable, distributed, hardware-agnostic processing engine - Voltron Data

Nimble, A New Columnar File Format - Yoav Helfman, Meta

Take #7: Hardware is still the catalyst for innovation in data processing

Hardware has always been a catalyst for data processing technologies. (If you don't believe me, just listen to Dhruba Borthakur, the creator of RocksDB).

Over time, database systems have changed a lot. First, there was Hadoop, which used cheap hard disk drives (HDDs). Then, low-latency systems like rocksdb came along because of cheap solid-state drives (SSDs). Now, we have cloud warehouses, which are possible because of the massive and cheap block storage on the cloud.

The main difference between the above and what is happening today, though, is that today the main driver of innovation is not storage but processing. GPUs, TPUs, FPGAs, any sort of on-chip accelerator is what is going to drive the next wave of innovation in data management systems.

VeloxCon's second day focused on hardware accelerators. Talks covered what hardware vendors are bringing and what query engines need to support this new hardware.

Velox, Offloading work to accelerators - Sergei Lewis, Rivos

NeuroBlade's SPU Accelerates Velox by 10x - Krishna Maheshwari, Neuroblade

Outro

VeloxCon has allowed me to take a glimpse of a future that is fast-coming.

There are many challenges, but with them also many amazing opportunities for building new technologies and delivering immense amounts of value.

I personally cannot wait to see what will happen in the next couple of months with Velox and the industry. Exciting times ahead!

WTF is a Vector Database?

Kostas Pardalis — Sun, 23 Apr 2023 04:57:53 +0000

It’s obviously a database, right? 😄 but how is it different from whatever you’ve heard until now that is a database? Like MySQL or PostgreSQL?

Let’s start by going through the basics and trust me, by the end of this you will have a much better understanding of WTF is a Vector Database!

WTF is a Vector Database?

It’s a database

I’ll perform some plagiarism here but it’s better to hear from someone who knows much better than me of what a database is.

“01 Course Intro & Relational Model - Intro to database systems (15-445/645)” Andy Pavlo, Carnegie Mellon University.

Vector databases do organize inter-related data that models some aspect of the real-world! They are not a core component of most computer applications yet but maybe if the AI revolution proves its current hype, they might be.

Databases usually come packaged as Database Management Systems (DBMS) you probably have also heard this term already and it’s important to keep in mind the difference between a database and a DBMS.

A set of CSV files in your file system can definitely be a database. It absolutely follows the above definition. It can contain information that is inter-related and that models some aspect of the real-world.

Database to DBMS

But what turns a database into a DBMS is what makes databases hard in general. A DBMS includes functionality for:

Ensuring Data Integrity
Data manipulation and access, i.e. add new data
Durability, i.e. what if the database crashes?

If in the above functionality we also add APIs for generic software to interact with the database for storing and processing data, then we have the definition of what a DBMS is.

Data Models & Databases

Let’s see what CMU-DB and Prof. Pavlo have to say about data models.

“01 Course Intro & Relational Model - Intro to database systems (15-445/645)” Andy Pavlo, Carnegie Mellon University.

And most importantly let’s see some examples of Data models.

“01 Course Intro & Relational Model - Intro to database systems (15-445/645)” Andy Pavlo, Carnegie Mellon University.

The course is about Relational databases but you might have noticed that there’s a mention to vectors in there!

This is important here because it gives us the first concrete definition of what a vector database is.

💡 A vector database is a DBMS that supports a Vector Data Model, in other words it’s a DBMS that uses vectors for describing the data in a database.

As we will see it’s pretty easy to add support for vectors in most of the existing relational databases that exist today but what makes vector databases a different breed of databases is the native support they have for specific operations around vectors that are important useful for Machine Learning and AI.

What is a vector?

Lets take a trip down memory lane. Hopefully the following definition rings some good memories from your youth.

💡 a vector is a mathematical object that represents a quantity that has both magnitude and direction.

This is the definition of a vector that most people have encountered at some point in their life.

If we go a little bit deeper into Wikipedia, we will also find the following general definition of vectors.

💡 In mathematics and physics, a vector is a term that refers colloquially to some quantities that cannot be expressed by a single number (a scalar, or to elements of some vector spaces).

To the above definition let’s add the one that refers to what a vector is in computer science.

💡 In computer science, an array is a data structure consisting of a collection of elements (values or variables), each identified by at least one array index or key.

The above definition refers to “array” but array and vector are used interchangeably.

We will talk more about vector spaces and features and all the cool stuff of AI a bit later but for now the above definitions are what matter.

First, you have to forget what you might think of vectors at school, we are not talking about euclidian vectors here. Magnitude and direction are not important.

What is important is the way we plan to represent the world in our database.

💡 we use quantities that cannot be expressed by a single number, instead we care about elements of some kind of vector space and the way we can represent these values in a computer is as a collection of values with each one being identified by a key or index.

The above gives us the how we want to represent the world and the how to store this information in a way that a machine can process.

Why do we need vectors?

tldr - vectors can allow machines to understand how things like, text, photos and video are related to one another

So far we’ve been a bit too technical and offering definitions that might make things a bit more clear but we haven’t talked at all about why we even care about vectors. What is wrong with whatever traditional relational databases already offer?

It all started with our need to represent rich text documents not just syntactically but also semantically.

The idea is that we can try to represent a document as vectors of identifiers. These vectors now define a document or vector space which happens to also be an algebraic model.

Because of that, we hope that we can use the mathematical tools of algebraic vector spaces to do interesting things like figuring out how similar two documents are!

This idea is not new. Do you know this guy?

By Tim Bray (talk) - I created this work entirely by myself., CC BY-SA 3.0,

In case you don’t, this is Doug Cutting and he’s the author of Apache Lucene that was open sourced in 1999. Lucene is probably the first and most well known library for indexing and searching text. Lucene implements the “vector space model” we talked about.

Vectors and vector spaces are a powerful way to represent information in a way that we can perform search and comparisons beyond what the standard scalar operators allow us to do.

But hopefully you are already wondering why although Lucene and the vector space model concept exists since the 90’s, we care about vector databases today. Also, is Lucene a vector database?

To understand why, we need to talk about a few more things first. But before we do that, let’s summarize.

💡 Vectors are useful because we can turn rich information into vectors in an algebraic model in which we can apply standard algebraic operations like comparisons and measurements. These operations can then be used for information retrieval.

Embeddings

Since 1999 and Lucene, it took us about another 13 years to do the next step in information retrieval.

Welcome to 2013 and to the work of Tomas Mikolov at Google, called Word2Vec.

Word2Vec is a technique that uses neural networks to learn word associations in from a large corpus of text. These neural networks are generating what is usually called in the NLP literature, word embeddings, which are representations of words.

the representation is a real-valued vector that encodes the meaning of the word in such a way that words that are closer in the vector space are expected to be similar in meaning.

I hope you see how often the term “vector” is being used.

The beauty of these algorithms is that after we have created these embeddings or vectors, we can use mathematical functions like the cosine similarity, to measure the semantic similarity of words.

It’s also important to note that these embeddings or representations are represented as real-valued vectors.

Enter Transformers

Today, Word2Vec is not the state of the art in generating embeddings anymore. Instead we are using Transformers that are deep learning models. Models like GPT are based on Transformers.

Regardless of the model used though, the output remains the same. Our information is represented as a real-valued vector and we can still use math to retrieve semantic information from our data.

💡 Embeddings are representations of words that turns them into real-valued vectors that then can be used in conjunction with standard algebraic tools to extract semantic information, i.e. compare semantically two words.

let’s put everything together

In 2023 we have some amazing technologies that can take rich information as input, e.g. a novel, and turn it into a new representation which we can query using machines.

To do that, these technologies turn the information into real-valued vectors.

To work with this information we now need efficient systems to store and process these real-valued vectors and do that at scale.

That’s exactly what a vector database is.

💡 A Vector Database is a DBMS that can efficiently store real-valued vectors of arbitrary dimensions and perform operations on them like applying the cosine-similarity function. On top of that a Vector Database has to also offer all the functionalities commonly found in a DBMS like durability, integrity and manipulation of the data by the user.

Let’s see now what are the unique characteristics of a Vector Database and how one is built.

Check The Next Article in the Series for how to build one!

MLOps is 98% Data Engineering

Kostas Pardalis — Mon, 03 Apr 2023 18:19:05 +0000

MLOps is Mostly Data Engineering

💡 TL;DR MLOps emerged as a new category of tools for managing data infrastructure, specifically for ML use cases with the main assumption being that ML has unique needs.

After a few years and with the hype gone, it has become apparent that MLOps overlap more with Data Engineering than most people believed. Let’s see why and what that means for the MLOps ecosystem.

Introduction

MLOps is a relatively recent term. A quick search on Google Trends reveals that the term started being searched for, around the end of 2019.

Upon examining the trend line above, we can observe a significant spike that occurred at the end of 2021. Since then, the interest has remained high.

ML is not something new though, if we check Google Trends for that term, we will see that the term exists since 2004 and with the interest growing exponentially, since 2015.

Interest Over Time for the term Machine Learning on Google

Machine learning has made amazing progress in the past 10 years, with some of the most important achievements in tech being related to it.

The rapid growth of machine learning is what sparked the creation of MLOps as a category. With the pace of innovation around ML accelerating, teams and companies have started to have issues keeping up.

Building and operating ML products started putting a lot of pressure on the data and ML engineering teams and where’s there’s pain there’s also opportunity!

More and more people started seeing opportunities for bringing new products to the market, promising to turn every company out there with any data, into an AI driven organization.

And just like this, we reached to the state of the industry you can see below.

MLOps category as included in MAD 2023

Keep in mind that the above landscape includes only companies labeled as “MLOps” and there are overlaps with other categories in the ML category of MAD *2023*.

43 vendors, around $1B in investments without accounting for public companies like Google and AWS investing in the space.

What are all these companies offering? Let’s see!

What is inside an MLOps platform?

The MLOps vendors can be split among a number of product categories.

Deployment & Serving of models, i.e. OctoML
Model Quality and Monitoring, i.e. Weights & Biases
Model training, i.e. AWS Sagemaker
Feature Stores, i.e. Tecton

It’s important to mention here that the above categories are supplementary in many cases, for example if you use a Feature Store, you also need a service for model training.

If you pay attention to the product categories above, you will notice that there is nothing particularly unique about them in the grand scheme of things.

What do I mean by that:

Deployment and serving of models → This is a common operation found in both data engineering and software engineering. People have been deploying pipelines or even better, deploying applications of various complexity way before ML was a thing.

Model quality and Monitoring → This is a unique problem to ML. The way you monitor a model for quality is not the same as you do with a software project or a data pipeline. But this is only part of the quality problem as we will see later.

Model training → This is unique to ML but building models is nothing new, the question is what has changed in the past five years that requires a completely different paradigm in doing it?

Feature stores → This is one of the most interesting products of MLOps, for the uninitiated the first thing that comes to mind is some kind of specialized data base but feature stores are actually more than that. They are a complete data infrastructure architecture that is proposed and attempted to be productized. How different it is from the classic data infrastructure architectures? We will see.

Let’s see how each one of the above categories overlap (or not) with Data Engineering and what that means.

Deployment & Serving of Models

This is one of the most interesting aspects of MLOps in my opinion. Mainly because this is the part where the outcome of the work an ML Engineer does gets to the point where concrete value can be generated out of it.

A recommender can serve recommendations to users and fraud detection can be applied in real time.

But what is interesting here is that this process doesn’t have much to do with ML, the engineering problems are more related to product engineering.

We can think of a model as a function that requires some input and generates some output. To deliver value with this function we need a way to add it as part of the product experience we are delivering.

In engineering terms that means that we have to wrap the model as a service with a clean API that will be exposed to the product engineers.

Then we need to deploy this service in a scalable and predictable way, just like we do with any other service for our product.

After that we need to operate the service and ensure that it is provisioned the resources needed based on demand.

We also need to monitor the service for problems and be able to fix them as soon as possible.

Finally, we want to have some kind of continuous deployment - integration process to deploy updates to the service. Just like we do with any other service of our product.

As we can see, the above process is almost identical to managing the release cycle of any other software component out there while it’s primarily the product engineering involved as a stakeholder.

After all they have to ensure that the new functionality the model provides is integrated in the right way to the product without disrupting its operations.

There’s one specific need that is imposed to the engineering and ops teams because of having to work with ML models and this is related to monitoring the performance of the model itself but we will talk more about this later.

The question here is, if integrating a model to our product doesn’t differ than any other feature we release about the product, in terms of the release and platform engineering and operations, why do we need a whole new category of products?

My opinion here is that the industry is trying to solve the unique challenges of turning models into services by building complete new platforms, but this is less than optimal.

The true need here is developer tooling that will enrich the existing and proven platforms and methodologies for releasing and operating software at scale for the case of doing that with ML models as the foundational software artifact.

We don’t need MLOps engineers, we need tools that will allow ML Engineers to package their work in a way that the platform and release engineers will be able to consume and produce the artifacts needed for the product engineers to integrate into the product.

A recurrent pattern I see is an attempt from vendors who are trying to become category creators to define a new type of engineer.

In most cases, this is a crossover between existing roles, i.e. analytics engineer where you have someone who’s primarily an analyst but also does some part of the data engineering work, e.g. creates pipelines.

This is probably a smart marketing move but the world doesn’t work like that. New roles emerge and cannot be forced by a vendor.

Why we would like ML Engineers to assume responsibilities of a release or platform engineer? Why we would like the former to be introduced to a completely new category of tools that sounds alien to their practice?

Separation of concerns is a good thing both in software architecture and in organizational design.

Model Quality and Monitoring

This is where things are getting really interesting. quality assurance, control and monitoring is a huge topic in software engineering. In a way and with a bit of exaggeration, these are the elements that turn software engineering into… engineering.

There are many best practices and mature platforms for software quality related tasks. The problem is, that ML models can easily challenge these.

You might have heard that quality in data infrastructure is hard and it is. It’s not just the software that we have to monitor for quality, it’s also the data. And data is a different beast when it comes to applying quality concepts.

in ML the situation is even worse. You pretty much have a black box system generated and you need to monitor its performance by just observing its outputs based on the inputs it gets in production.

Because of this, Model quality and monitoring is usually mentioned together with terms like model drift. Where the model is monitored in terms of its “predictive” performance over time and if it drops under a threshold, we know that we need to retrain it with fresh data.

Which makes sense, right? As our product changes and our customers behaviors change, the model needs to get retrained to consider these changes.

I have two main arguments here.

The first is, how different is the observability of model quality metrics like drift different to any product related monitoring? In product we keep monitoring the performance of our features, do people engage with them in the way we expect? If something changed and engagement dropped, we should address that, right?

These are all part of what is usually referred as experimentation infrastructure for product and big part of it requires the right data infrastructure and data engineering to exist.

No matter how unique ML models are, at the end we are going to be observing a service - feature on how it performs interacting with our users and based on the data we collect, figure out if action is needed.

My feeling is that there’s a lot of overlap here between the ML observability and the data infra - engineering foundations that the organization is building for product experimentation.

My other argument is about data quality in general. ML models are built on top of data, their quality is a direct reflection of the quality of data used to build them.

This is a serious problem that data engineering is constantly fighting with and I can’t see how the replication of this process is helping in any way to solve the problem.

Data engineers are the people who are monitoring the data from its capture to the point where the ML engineer can use it. They have access to the whole supply chain of data and they can monitor and add controls at any point of that chain.

Adding another platform that is overlapping with both the data engineering and product engineering quality controls is not going to solve the problem and in the worst case it might make it even worse.

Again, the solution here is engineering tooling to enrich the existing architectures and solutions. Finding out what quality for data entails and equip the people who’s job is to ensure data and product quality to extend their reach into the ML models too.

Model Training

This is a short one to be honest. Model Training has more to do with Cloud Computing than anything else and in my opinion this is the space where the big cloud providers are mainly delivering value today. The main reason being the need for hardware to exist to do the actual training.

But in the general case, model training is nothing more than a data pipeline. Data is read from a number of sources and gets transformed through the application of a training algorithm. It doesn’t matter that much if this going to happen in the CPU or the GPU.

This is the bread and butter of Data Engineering, the tooling exists already and the main differentiation that I see here is the cloud compute abstraction where we are talking about a completely different category of infrastructure anyway.

Model training at scale should be part of the data engineering discipline as they have the tooling already, they have the responsibility for the SLAs on the data needed and they can control that release lifecycle much better.

Do the ML people bother with these operations? I can’t see why to be honest. I believe they would prefer to spend more time in building new models than dealing with operations for data crunching at scale.

I’m getting boring at this point but again, we don’t need new platforms. We just need to give the right tooling to DEs to communicate effectively with both ML and production engineers and add model training as another step in their ETL pipelines.

Feature Stores

I left Feature Stores for the end on purpose as they are a great example of the overlap with data engineering while their popularity is a great indication that something is not right with the current state of data infrastructure.

The above is a feature store architecture as presented by Tecton, one of the first and most popular feature store vendors.

Looking at that we see that we have:

Stream data sources
Batch data sources
Transformations
Storage
Serving
Model serving and training

Feature stores are similar to a typical data infrastructure architecture used by companies that require both streaming and batch processing capabilities. However, they specialize in supporting machine learning features by serving only one type of data consumer - the ML model.

Vendors have packaged the feature store architecture into products, which has caused some confusion. Some may question the need for another Spark or Flink cluster for real-time feature generation, especially if they are already using those tools for ETL jobs. However, feature stores are useful because they describe what needs to be added to existing data infrastructure to effectively productize machine learning.

As a product, feature stores should focus on building tooling and practices for data, ML, and product engineers to work together more effectively. Any additional overhead and complexity should be carefully evaluated to ensure that the benefits of using a feature store outweigh the costs.

Vendors should focus on providing useful tooling to support this, rather than duplicating existing data infrastructure.

Final Thoughts

I hope that by reading this essay you didn’t feel like I’m trying to dismiss MLOps because I’m not.

I believe that ML and its productization is important and will become even more important in the future and for this to happen the right tooling is needed.

But it’s time for the MLOps industry to mature and understand who the right audience is, what the problems are and bring the next iteration of solutions in the market.

Money and time was spent and lessons should have been learned. I can’t wait to see what the next iteration of these products will be.

There’s a lot of opportunity ahead!

What is a Data Contract?

Kostas Pardalis — Thu, 30 Mar 2023 19:30:26 +0000

You might have started hearing the term Data Contract recently and wonder what it is.

TL;DR: Data contracts are just integration testing, CICD, and APIs for data.

But, the difficult part of data contracts comes from the organizational complexity of implementing them.

technically data contracts require to invest into data lineage and data profiling. Nothing much new here.

Data quality is a team sport though, you really need everyone in the org to buy into the concept of data contracts to implement them effectively.

Data quality is also hard, so you need to clearly articulate the value it brings for people to stick on caring about it!

So, for data contracts to succeed, together with the tech we also need:

create the right environment for people to collaborate cross-functionally.
Communicate effectively across the whole organization, starting from leadership, about the value of data quality or the cost of bad quality in data.
Focus on incremental data quality improvements instead of trying to build a complete end to end solution. Data quality is a continuous process anyway and you want to start delivering value as soon as possible.
Learn from other disciplines. Alert fatigue is a thing and SREs and DevOps have known about it for a long time. No need to reinvent and learn by doing the same mistakes other engineers have already done.

Finally, remember that Data Contracts already exist in any organization. They are just implicit. The whole point of talking about Data Contracts is to encourage people to make them explicit and to understand the value you get by doing that.

For more on this topic, Chad Sanderson who invented the term Data Contract did an amazing job explaining everything on this Data Stack Show episode.

A Tutorial on SQL Window Functions.

Kostas Pardalis — Sat, 18 Mar 2023 21:27:26 +0000

Working with SQL Window Functions

💡 TL;DR: Window functions are among the most powerful and useful features of any SQL query engine. However, the declarative nature of SQL can make them feel counterintuitive when you first start working with them. In this guide, I will demonstrate the beauty of SQL windows and show that they are actually much less intimidating than you might think (and even fun!).

SQL Window FunctionsSQL Window Functions

DuckDB provides 14 SQL window-related functions in addition to all the aggregation functions that can be combined with windows. Snowflake, on the other hand, offers more than 70 functions that can be used with SQL windows. PostgreSQL also supports 11 SQL window-related functions, as well as all the aggregation functions that are packaged by default, in addition to any user-provided aggregation function.

Hopefully, the above information has captured your attention and helped you realize how important SQL windows are, based on the effort database vendors are making to add support for them.

But what’s a window in SQL?

The concept of windows is actually pretty simple. It allows us to perform calculations across sets of rows that are related to the current row in some way.

Think of iterating through all the rows but the calculation we want to perform is related not just to the current row values but also to a subset of the total rows.

Another way to think about window functions is by considering the GROUP BY semantics. When we use GROUP BY we are asking SQL to compute a function by grouping first the data using the parameters of the GROUP BY clause. Consider the following SQL

 select user_id, count(events) as total_actions from user_activity group by user_id;

In the above example, we ask SQL to split events among unique user_ids and count them for each user separately. Both the calculation but also the grouping results will be included at the resulting table. So for example:

user_id	event
1	click
2	click
2	load
1	load
1	load

Assuming the above input table, the result of the query will look like this:

user_id	total_actions
1	3
2	2

You can think of SQL Windows as something similar in how data is processed but without having necessarily to present the results grouped at the end.

But windows are actually even more powerful as we will see.

What comes after partitioning ?

As we’ve seen previously, partitioning is a first important concept to understand about windows. We can create sub-sets from our data and perform calculations only inside these sub-sets and the main mechanism for defining the sets, is by describing to SQL how to partition the data.

But we can do more than that! See the example below,

Here's an example SQL query that uses a window function and the LAG() function to calculate the time lapse between consecutive events:

SELECT
  event_time,
  event_type,
  user_id,
  event_time - LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time) AS time_lapse
FROM
  user_activity

This query partitions the data by user_id and orders it by event_time. Remember what we said previously about partitioning? You see it here in action. We want our calculation to be performed for each user we have so we will partition on it.

We will also sort our data based on event_time, the reason we do that, is because we want to calculate the time it took our user to perform one event after the other. The reason we are sorting is because of what we’ll do next.

Our query then uses theLAG() function, which is the window function that will do the magic for us. What LAG() does, is to make available the previous value to the current row we currently process, within our window!

The code part:

event_time - LAG(event_time)

Does exactly that, while we go through the current row’s event_time, we get access to the event_time value from the previous row and because the rows are sorted, we can now subtract the values and calculate the time lapse!

The window magic is defined in this part of the syntax:

event_time - LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time)

The OVERterm indicates the start of the window definition, just try to read this as a sentence written in english and it almost explains what is happening. That’s part of the beauty of being a declarative langue!

The difference is going to be calculated over sets of data that are created using the user_id, so we get one set of rows for each user and we also sort this set based on the event_time. What is important to remember here is that sorting is not global, instead data is sorted only assuming each individual set defined by the user_id.

After the partitions have been created and sorted, the query engine starts iterating the rows of each partition and at each one the LAG function will make the value of the previous row available to the engine.

At this point, everything is available for the engine to calculate the difference between the two values and that’s exactly what it does!

LAG() and the above example is a great introduction to the last important concept about windows in SQL. Framing!

Windows and Frames

Windowing breaks up a relation into independent groups or "partitions," then orders those partitions and computes a new column for each row based on nearby values.

In many cases, the functions that we apply depend only on the partition boundary and maybe also the ordering, see the very simple first example we went through.

In other cases though, the function might need access to some of the previous or following values. This was the case with LAG in our previous example.

Although we had defined a partition based on the user_id, we also needed to provide to our function (in this case subtraction) with the previous to the current value. This is exactly what LAG did.

Frames are a generalization of this concept.

💡 A frame is a number of rows on either side (preceding or following) of the current row.

In our previous example, the frame was one row preceding our current row. Although we didn’t provide how many preceding rows we’d like to consider and thus LAG used the default which is 1 but we can use any number we want.

In DuckDB the definition of LAG is:

lag(expr any [, offset integer [, default any ]])

Offset refers to the number of rows preceding the current one that we want to access. We can also, set a default value to return if the requested offset does not exist. For example, if we are at the first row and want to access the previous one, we can define a default value to return instead.

Let’s consider the following table:

To better understand the concepts of partitions versus frames, let’s see how the table will look like if we apply a window like the following:

p_timestamp - LAG(p_timestamp) OVER (PARTITION BY user_id ORDER BY p_timestamp)

In the table below you can see how it will look like after the application of PARTITION BY.

And this is how the table looks like after we order it. If you notice you will see that ordering exists only inside each partition and it’s not global.

You can think of the frame as a “window” sliding over the partitioned and ordered data, with a size equal to the offset parameter, in this case the offset is 1.

Consider that we are currently at row with event_id = 16249. The Frame will include this row and the previous one based on what we’ve said so far. What do you think the result of the Lag function will be from this frame?

The answer is 0. Remember that the frame has a default value equal to 0 which is returned when there’s no preceding row? Remember also that the frame has meaning only inside the boundaries of the partition?

The frame in this case is at the first position of the current partition and as a result the default value will be returned.

What about Nulls? Do they matter?

Null values always matter! 😀

We always should be aware of the null semantics around our window functions. Always check the documentation and see how the window function we care about is behaving in the presence of nulls.

For example in DuckDB, some functions can be configured to ignore nulls although the default behavior for all window functions is to respect them. Such an example is the LAG function that we used earlier.

In any case, make sure you understand well the semantics of your functions and the data you are working with. What if an aggregation function does a division by a null value?

Enough with theory, let’s have fun!

Ok, let’s work on some examples using window functions. For these examples we will be using DuckDB.

First, download DuckDB if you haven’t already. I’d recommend to just download the CLI but feel free to use any other way of working with DuckDB.

We will be working with JSON files so you will also need to install the JSON extension for DuckDB. To do that, you just have to:

#./duckdb

Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D install 'json';

That’s all, now you are ready to start playing around and being dangerous.

The case of Sessionization

We will use a very common problem that requires window functions. We want to be pragmatic here, so using a real life example that you most probably will face sooner or later is what we are aiming for.

What is sessionization?

Assuming we have a number of user interactions captured in different moments in time, how can we group them into “meaningful” buckets. By meaningful in our context, we mean events that happened during one online session of the user.

💡 Typically a session is defined as: all events that happened in less than 30 minutes from each other for a specific user.

The definition can get more complicated but this will suffice for our needs and to be honest it’s one of the most commonly used ones. For example, Google Analytics is using it as the default session definition.

The data

Now that we have the tools and the problem, we just need data and we are ready to go.

Again, we will try to be as realistic as possible. We will be using customer event data captured in the format supported by RudderStack and Segment.

These are the two most commonly used tools for capturing user interactions.

For more information on the whole schema of this format, you should check the amazing documentation provided here.

We will be using data that has been artificially generated, in case you’d like to generate your own data, you can use the tool I used. You can find it here.

I’ll also include a sample file that you can use directly! Using the event generator is useful if you want to experiment with different number of users and events and work on performance of your queries.

The queries

The first thing we have to do is to load our data into DuckDB and see how they look like.

Here I’ll assume that the file is named test.json and that it’s in the same path that you run duckdb from. Feel free to play around with paths etc, it helps to get a better grasp of how the CLI works and the SQL syntax of DuckDB.

D select * from read_json_objects('test.json') limit 2;

And if we execute the above, we’ll see something like this as output:

Although this worked, it’s not exactly useful yet. We just ended up with a table that has just one column of type json containing our json objects, one line of the input file corresponding to one row of the output table.

To make this more useful, we need to use some of the additional DuckDB magic for working with JSON.

consider the following query:

D select json_extract(json, '$.context.traits.id') as user_id,
        json_extract(json, '$.message_id') as event_id, 
        json_extract(json, '$.event_type') as event_type, 
        json_extract(json, 'original_timestamp') as original_timestamp 
    from js 
    limit 1;

The result you get should like something like this:

What we did was to use the json_extract function of DuckDB to extract only the fields we care about. In this case we want:

user_id, so we can create sessions for each user_id
event_type, this is not necessary but it might be helpful to have some meta around our data
original_timestamp, this is obviously needed as we need to perform calculations based on time to create the sessions
event_id, we want a way to link back to the initial record.

If you paid attention to the above results you will notice a few issues

All the data types are of type json
The event_id is null

The first issue is expected and it’s part of our job to take care of it as we build our code. The second issue though is weird, shouldn’t the event have a unique id? Is this a coincidence?

Let’s see what we can figure out, but first let’s make our lives a bit easier by executing the following sql

D create view extracted_json as 
        select json_extract(json, '$.context.traits.id') as user_id,
            json_extract(json, '$.message_id') as event_id, 
            json_extract(json, '$.event_type') as event_type, 
            json_extract(json, 'original_timestamp') as original_timestamp 
        from js 
        limit 1;

We create a view so we don’t have to run the long query above every time we want to query it. Now let’s see what we can learn about the message_id column.

D select distinct event_id from extracted_json;

┌──────────┐
│ event_id │
│   json   │
├──────────┤
│ null     │
└──────────┘

The above query gives us back all the distinct values of the column and for event_id everything is null which is not good!

Obviously this shouldn’t have happened, the events should have a unique ID but reality is far from ideal and issues like that can always happen, so how do we move forward with this dirty data set we have?

If we need the event_id, we have two options:

Check our pipelines to see why the event_id wasn’t captured. Maybe when you extracted the data the pipeline ignored the field or maybe it wasn’t even captured at the first place.
Come up with a solution to add a unique id for our current data set.

Although we don’t need the event_id for our sessionization example, we’ll go through an example of how this could be done. Keep in mind that there are many different ways to do it actually!

The options:

A common way to create unique IDs is to use hashing. This will also allow you to deduplicate your data if you need to. The way to do it is by using a hashing function, i.e. md5, and hash the whole row. In this case if two rows are completely identical, the generated hashes will be equal.
An even easier way to do it, is to just add the position of the row in the table as the id. This is going to ensure uniqueness of the id but it won’t help you in deduplicating the data.
Use something like the uuid DuckdDB function that returns random uuids and hope that random also means unique.

In our case we will opt for the second option as it’s the easier one but please feel free to to try and do the first!

Also, to perform this task is an awesome gentle introduction into window functions as we will use our first window function.

See the following SQL:

D select user_id, 
                    event_type, 
                    original_timestamp, 
                    row_number() over () as event_idx 
from extracted_json 
limit 1;

We excluded event_id as part of cleaning our dataset and used row_number() to get the row number and use it as the id for the event. See the use of the over keyword? That’s an indication that row_number() is a window function.

In our case we didn’t want to have partitions because that wouldn’t generate globally unique ids, so we run the window function over the whole table.

Now that we have a way to generate unique ids let’s figure out how to get rid of the json type and turn it into something more useful.

Consider the following SQL query:

D select cast(user_id as string) as user_id from extracted_json limit 1;

┌────────────────────────────────────────┐
│                user_id                 │
│                varchar                 │
├────────────────────────────────────────┤
│ "15474ff6-3e59-44fa-a875-13c1b2f9d101" │
└────────────────────────────────────────┘

The function CAST is what we need here. We ask DuckDB to take the JSON type and turn it into a String and in this case it worked perfectly as you can see by the result.

Now consider the following:

D select cast(original_timestamp as timestamp) as original_timestamp 
from extracted_json 
limit 1;

Error: Conversion Error: timestamp field value out of range: ""1970-01-19 15:00:17.100 UTC"", 
expected format is (YYYY-MM-DD HH:MM:SS[.US][±HH:MM| ZONE])

Ouch, we got an error! Apparently the format we used for the date cannot be converted into a timestamp. We need to fix this before we move on.

💡 These type of issues are extremely common when working with SQL and data in general, that’s why I thought it would be good to actually have to figure this out as part of the exercise!

Check the following query:

SELECT 
  CASE 
    WHEN original_timestamp LIKE '%.%' THEN strptime(trim(both '"' FROM original_timestamp), '%Y-%m-%d %H:%M:%S.%f UTC') 
    ELSE strptime(trim(both '"' FROM original_timestamp), '%Y-%m-%d %H:%M:%S UTC') 
  END AS parsed_timestamp
FROM extracted_json limit 1;

┌──────────────────────────┐
│     parsed_timestamp     │
│        timestamp         │
├──────────────────────────┤
│ 1970-01-19 15:00:17.0001 │
└──────────────────────────┘

we did it! So what happens with the above query.

First we need to trim the character “ from both the beginning and the end of the value.
Then we need to account for two cases, one where milliseconds exist in time and one for when they don’t. Again this is an issue with the generation of the data and we have to fix it here.
For each case, we use strptime to get a timestamp out of the string.

CASE is the equivalent of IF-THEN-ELSE statements in SQL.

Now that we have figured out everything, let’s transform our raw data into something easier to work with by creating another view.

SELECT 
   CASE 
 WHEN original_timestamp LIKE '%.%' THEN strptime(trim(both '"' FROM original_timestamp), '%Y-%m-%d %H:%M:%S.%f UTC') 
     ELSE strptime(trim(both '"' FROM original_timestamp), '%Y-%m-%d %H:%M:%S UTC') 
   END AS p_timestamp,
   cast(user_id as string) as user_id,
   cast(event_type as string) as event_type,
   row_number() over () as event_id
 FROM extracted_json limit 1;

and we have what we need! Now let’s create a view so we can work with it easily.

create view events as SELECT 
   CASE 
 WHEN original_timestamp LIKE '%.%' THEN strptime(trim(both '"' FROM original_timestamp), '%Y-%m-%d %H:%M:%S.%f UTC') 
     ELSE strptime(trim(both '"' FROM original_timestamp), '%Y-%m-%d %H:%M:%S UTC') 
   END AS p_timestamp,
   cast(user_id as string) as user_id,
   cast(event_type as string) as event_type,
   row_number() over () as event_id
 FROM extracted_json;

D select count(*) from events;

┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│        31415 │
└──────────────┘

And

D describe events;

We are good to go!

I know it’s been a journey so far, but we already worked with a window function and we also did something important, cleaned and prepared our data!

This is big part of the work involved with data.

Now let’s go back to sessions. Remember the definition of a session?

💡 A session is the set of events for a particular user that happened in less than 30 minutes between successive events.

If you remember the examples you gave earlier you might have already figured out that LAG is probably a great candidate for helping us with our problem here, let’s see how.

consider the following query:

WITH events_enriched AS (
  SELECT 
    user_id, 
    p_timestamp, 
    LAG(p_timestamp) OVER (PARTITION BY user_id ORDER BY p_timestamp ASC) AS prev_timestamp
  FROM events
), 
sessions AS (
  SELECT 
    user_id, 
    p_timestamp, 
    prev_timestamp, 
    SUM(CASE 
      WHEN p_timestamp - prev_timestamp > interval '30 minutes' OR prev_timestamp IS NULL THEN 1 
      ELSE 0 
    END) OVER (PARTITION BY user_id ORDER BY p_timestamp ASC) AS session_id
  FROM events_enriched
)
SELECT user_id, p_timestamp, session_id
FROM sessions
ORDER BY user_id, p_timestamp ASC;

Here we are also using CTEs that we haven’t talked yet, but don’t worry that much if you find this WITH syntax new. It’s mainly a way to organize the code and make it cleaner.

As you can see we start by enriching our events by adding the previous timestamp as a new column. To do that we of course use LAG and windows! The way that part of the query works should be clear to you by now.

The second part is the session creation. Here we are creating a new column that tracks the session id. The interesting part is what’s inside the SUM clause.

Again here you see the beauty of the declarative nature of SQL. We can add a 0 or 1 based on the difference between the two columns that represent the event times.

Once again, we use SUM as a window function, remember that all aggregation functions are window functions, to calculate the session id for each user.

With that query we will end up with a result like this, I will limit the results to 10 for convenience.

Isn’t pretty? 😊

Conclusion

Window functions are super powerful. If you master the concepts behind them you’ll be able to write some very expressive and elegant SQL code.

They might require from you to change the way you are thinking, especially if you are coming from more imperative programming languages but it won’t take that long to get comfortable with them.

I hope that the examples I gave were helpful!

In any case, please let me know of what you think and what else you’d like to see as a SQL tutorial.

References

DuckDB Intervals Documentation

DuckDB Timestamp Documentation

DuckDB Text Functions Documentation

DuckDB Date Formats Documentation

DuckDB JSON Extension Documentation

DuckDB Window Functions Documentation

DuckDB Installation Documentation

RudderStack Event Schema Spec

PostgreSQL Window Functions Documentation

Snowflake Window Functions Documentation

Fake Events Generator Source Code

Sample Event Data

A Practical Guide to SQL Dates with DuckDB.

Kostas Pardalis — Mon, 06 Mar 2023 00:08:09 +0000

A pragmatic guide on working with dates in SQL using DuckDB

Ok so we talked on how SQL deals with Dates and there's a rich arsenal of tools to help us do whatever we want with time information in SQL.

But, working with dates is still a pain.

You might think that the technology is just not there but I would argue that it's not a technical issue but more of a human issue plus the fact that time is just a hard concept.

Interpreting time is a messy business.

There are different ways to do it and there's no way that someone can figure this out just by observing syntax.

For example consider the following date:

"2023-01-02"

What's the right date format?

"yyyy-mm-dd" or "yyyy-dd-mm"

Both the above formats can syntactically express the above date but only one is right and to know which one, we either need more samples or we need someone to explicitly tell us what the format is.

This is just a tiny example that people from Europe moving to the US might have encountered in their every day lives, even without having to write code to parse dates.

Computers are simple

They are and they have a very elegant way of representing time.

The most common way to do it is by using an integer that represents time passed from a specific starting date.

The most common example of this approach is the Unix Time Stamp which tracks time as a running total of seconds, the count starts at the Unix Epoch on January 1st, 1970 at UTC.

You might wonder about how we count time before that, well, computers can represent negative numbers, right?

Assuming all time stamps are in UTC, representing dates like this is a very elegant way.

Applying operations on dates becomes as easy as applying operations on integers, e.g. addition and substraction.

Of course it wouldn't be fan if everything was perfect.

So, we still have to ensure that time stamps are aligned on the timezone but this is relatively easy as we just have to offset a specific number to convert any timezone to UTC or any other.

Humans have agreed on something called UTC offset, where the timezone contains also the offset from UTC.

For example:

PT is the Pacific Time Zone. We have PST (Pacific Standard Time) and PDT (Pacific Daylight Time) with:

PST: UTC-08:00
PDT: UTC-07:00

As you can see from the above, we can easily get a UTC from a PST time by adding -8 hours and from a PDT by adding -7 hours.

As long as there's syntactical information about the timezone in the date, it's easy to work with it. we just need an index with all the timezone offsets and the timezones they correspond to it.

But as you can see, even working with just the timezone information can get tricky, what if there's no explicit information about the timezone in the date we got?

Computers are simple as we said and that means that they don't assume things on the other hand humans love implicit information.

And again, this is where things get hard with dates.

Serializations are not that simple though

Let's get a bit more practical now and consider how information is exchanged between machines on the Internet.

We will consider the following formats:

JSON
CSV
Protocol Buffers

JSON is the defacto format for exchanging data between web services.

CSV is probably the oldest human readable format and despite it's age and issues, it's still there.

Protocol Buffers is the defacto format for gRPC and related microservices.

For the above reasons, we can assume that anyone working in tech will have to work with them sooner or later.

(You can't escape!)

Let's see how each format represents date and time information.

JSON

How does JSON represent dates and times? Is there a datatype used?

The answer is: It doesn't handle dates and there's no datatype similar to what SQL has.

Dates and times are strings or integers if someone decides to store it as a timestamp but there's no way to tell by validating the JSON doc if a key-value pair is representing a date or not. You need a human for that.

CSV

CSV also does not have a date or time type. Actually it doesn't have types at all.

At least with JSON a numerical field will always have a number.

In CSV you can have a column - field that has mixed values and there's no way to guarantee that this will not happen.

Protocol Buffers

Protobuf has support for timestamp types. It only supports a timestamp type, represented as a seconds and fractions of seconds at nanosecond resoltuion in UTC Epoch time (remember Unix Epoch?).

Obviously the support Protocol Buffers has when it comes to representing time is much more limited than something like SQL and for a good reason.

Protocol Buffers care only about computer communication, its design is not driven by the need to render dates and times to end users in every possible way.

Also, by limiting the ways that time can be represented, they can perform a lot of optimizations on how the information is serialized on the wire.

Something that is super important for protocols like that.

So what did we learn here?

Support for date types varies from protocol to protocol and depending on where the data comes from, an engineer will pretty much have to deal with every possible scenario.

So, how can we work with dates and time without shooting ourselves in the foot?

Some Examples with DuckDB

Let's see some examples using DuckDB.

Consider the following CSV data:

user_id	Event	Timestamp
1	click	2023-02-28T21:07:13+00:00
2	load_page	2023-02-28T21:08:13+00:00
3	load_page	2023-02-28T21:09:13+00:00
4	click	2023-02-28T21:10:13+00:00

The above timestamp is represented in RFC3339 format.

Let's see how we can parse this into SQL types using DuckDB.

DuckDB has great CSV parsing support. Assuming our csv file is named events.csv we execute the following command.

create view events as select * from read_csv_auto('events.csv');
select * from events;

and we get the following results:

what is amazing is that DuckDB managed to guess the timestamp type and import it as timestamp directly!!

That's because of the amazing work the DuckDB folks have done in delivering the best possible experience but also because we chose to use a standard like RFC3339 for representing dates.

Now we can move on and keep working with our data.

Now let's see how powerfull DuckDB is in infering date formats. To do that, let's change the date from 2023-02-28 to 2023-02-02.

The reason we want to try this is because in the first case, it's easier to infer the format as we have only 12 months so the 28th should be a day.

Interestingly enough, DuckDB still infers the date format and creates a timestamp column!

Just try it using the same code as previously and you will see.

Finally, if we completely remove the time from the data, DuckDB will infer and use the data type Date instead of Datetime.

So, DuckDB is doing a great job in helping us work with date time data!

Let's update our previous table and add a column that has a Unix Timestamp too.

user_id	Event	Timestamp	unix_time
1	click	2023-02-28T21:07:13+00:00	1677647233
2	load_page	2023-02-28T21:08:13+00:00	1677647293
3	load_page	2023-02-28T21:09:13+00:00	1677647353
4	click	2023-02-28T21:10:13+00:00	1677647413

D create view events_time as select * from read_csv_auto('events_time.csv');
D select * from events_time;

and the results we get are:

As you can see, DuckDB doesn't know that unix_time is a timestamp, it registers the data type for this column as int64 which is what should be expected.

Many times we will just have timestamps being shared and not dates, how do we deal with that?

We just need to let DuckDB know what the data we are dealing with is.

In this case we know we are dealing with Unix timestamps and DuckDB has a number of utility functions to help us, more specifically we will look into the epoch_ms function.

Let's execute the following SQL statement:

select user_id, event, timestamp, epoch_ms(unix_time) as unix_time from events_time;

The result we are getting is:

Which is obviously wrong, actually the timestamp and unix_time columns should match and there's a huge difference between events happening in 2023 and 1970.

What went wrong? Remember the Unix Timestamp definition, it's measured in seconds from the Unix Epoch but the signature of our function is epoch_ms which expects milliseconds.

Let's try again with a slightly different query:

select user_id, event, timestamp, epoch_ms(unix_time * 1000) as unix_time from events_time;

Now the results look like:

Better but still doesn't look right. What's going on here? Let's try something to see if we can figure out what the issue is. Consider the following query.

select timestamp - epoch_ms(unix_time * 1000) from events_time;

The results will look like this:

There's something interesting here, the different is standard between the two timestamp columns and it also happens that the difference is equal to the difference between the timezone I live at (Pacific Time) and UTC.

So, the reason there's a discrepancy between the two columns is because of the different timezones used and if we would like to be consistent, we should take into account the timezone in both cases.

Consistent semantics around time is one of the most important best practices when dealing with time in databases.

DuckDB has recently added support for working with JSON documents too. You should give it a try and repeat the above examples using JSON instead of CSV and see what the differences might be, if any.

Lessons Learned

Consistent Semantics

Consistency is key!

Keep it consistent, that's the best advice when it comes to working with dates. Stick with a format and make sure that everyone follows that.

Now, that's easier said than done, after all we will always have bugs and make errors. Even when there's strict concensus on how to represent something.

First load, then deal with the data

First load your data into your database and then try to figure out if issues exist or not and how to deal with them.

That doesn't mean that you should load the data into production tables though. On the contrary, load the data using temporary dev tables and then merge with production when you are sure on the data format.

Why do that? Because the last thing you want to do is to fail a whole pipeline because one date is malformed. To summarize.

Separate ingestion and loading logic on your data pipelines. First load into a temp table, then validate and then merge with production tables.

Ingestion takes time, especially with big workloads and a lot of that goes into IO and converting from one format to the other. You don't want to waste resources doing all that because of syntactic issues that can be solved.

Separate Storage from Presentation

Because you will be sharing date and time information with humans, it doesn't mean that you should store the data in the same format that a human expects.

Instead, stick to timestamps, assuming a standard timezone (most probably UTC) and whenever data has to be presented to the user, let the client figure out the best way to present the information.

By doing that, you ensure clear semantics across all your storage and at the same time maximize flexibility in terms of presenting the information to the consumer.

Learn to trust your database

Don't try to re-invent the wheel, no matter what issue you have with your data, when it comes to dates there's definitely someone who had to deal with the same in the past.

This knowledge has been integrated into query engines and databases and there are plenty of tools and best practices to follow.

Useful Links & References

Online Unix Timestamp Converter

DuckDB Date Functions

DuckDB Timestamp Functions

DuckDB Data Import documentation

DuckDB Client Download

Introduction to SQL Timestamp, Date and Time Data Types.

Kostas Pardalis — Tue, 21 Feb 2023 20:49:47 +0000

TL;DR: Working with Time in SQL is one of the most common tasks that people are seeking help for.
In this article I will cover the most common ways to work with time in SQL.

SQL Timestamp, Date and Time Data Types

The SQL standard defines the datetime datatype
which can be further specified using the following descriptors:

DATE
TIME
TIMESTAMP
TIME WITH TIME ZONE
TIMESTAMP WITH TIME ZONE

Together with the value of the time fractional seconds precision if the descriptor is one of:

TIME
TIMESTAMP
TIME WITH TIME ZONE
TIMESTAMP WITH TIME ZONE

SQL also defines intervals in a similar way. But there's only one possible type desceriptor, in this case INTERVAL.

Together with, an indication of whether the interval is a year-month or day-time interval and finally the interval qualifier that describes
the precision of the interval data type.

Things to Remember

👉 Datetimes only have absolute meaning in the context of additional information.

So, Unless that time zone specifier, and its meaning, is known, the meaning of the datetime value is ambiguous.

Therefore, datetime data types that contain time fields (TIME and TIMESTAMP) are maintained in Universal Coordinated Time (UTC), with an explicit or implied time zone part.

👉 The time zone part is an interval specifying the difference between
UTC and the actual date and time in the time zone represented by
the time or timestamp data item.

👉 Items of type datetime are mutually comparable only if they have
the same datetime fields.

👉 There is an ordering of the significance of datetime fields. This
is, from most significant to least significant: YEAR, MONTH, DAY,
HOUR, MINUTE, and SECOND.

Ok, that's all you need to know! (Joking, there's more) but the above statements are helpful to always keep in mind
when working with time in SQL.

Operating on Datetime types

The basic arithmetic operators are supported for Datetime type, although it's always a good idea
to check the documentation of the database you are using to check the exact semantics implemented for the operator.

The following operators are the most commonly found ones:

Addition (+)
- date + integer
- date + interval
- date + time
- interval + interval
- timestamp + interval
- time + interval
Subtraction (-)
- -interval (Negation)
- date - date
- date - integer
- date - interval
- time - time
- time - interval
- timestamp - interval
- interval - interval
- timestamp - timestamp
Multiplication (*)
- interval * double precision
Division (/)
- interval / double precision

Again, the most important thing is to check with the database documentation on the semantics of the above operations.

For example, what should be the return type of each operation?

Date and Time Functions

There are some functions are that more important than others or at least you are going to need them more often.
These are:

👉 First the popular trunc(field, source) function. Where source is a value expression of a datetime related type and field is used
to define the precision of the truncation. Some examples of valid field values from postgres are:

microseconds
milliseconds
second
minute
hour
day
week
month
quarter
year
decade
century
millennium

As usual, check your documentation!

👉 Then we have now() which returns the current timestamp, usually at the beginning of the query execution.
There also a number of other similar functions, like:

current_date
current_time
current_timestamp
current_time(precision)
and many more...

👉 The extract(field from source) function is used to retrieve subfields such as year or hour from date/time values.
The extract function is primarily intended for computational processing.

👉 The date_part(field, source) function extracts a part of the date, timestamp, or interval.

👉 date_format(expr, fmt) is used to format a date time value using the format specified by fmt.

Formatting is a bit of a complicated matter as there's a lot of flexibility on how a date can be formatted.

There's a "language" that is used to define patterns for the formatting. This language is at least similar
among the different databases and as I've said many times, you should check the documentation!

👉 Finaly there are the casting expressions like cast(expr as type) that are used to cast between different datetime types and others.
For example casting a string into a date or timestamp or a datetime to timestamp etc.

The above are some of the most commonly used functions and operators when working with time in SQL.
Most database systems offer many more for convenience but the above should be enough to cover most of the use cases.

Windows

This is where things starts getting more complicated when working with dates and times in SQL.

Windows are important as they allow to define partitions over your data and perform aggregations.

Consider for example the case of calculating the Monthly Recurrent Revenue. To do that, you first
need to partition your data by month and then calculate the sum of the revenue for each month.

The most common way of defining a window is using the OVER clause. For example:

SELECT
    date_trunc('month', date) AS month,
    SUM(revenue) OVER (PARTITION BY date_trunc('month', date)) AS mrr
FROM
    transactions

The above query will calculate the MRR for each month and it's a good example of how powerful the combination
of windows and datetime functions are.

There are other types of windows available but they can be grouped in two main categories.

Ranking windows
Aggregate (or analytic) windows

The most common example of a ranking window function is rank() which returns the rank of a value compared to all values in the partition.

A common example of an analytic window function is lag(args) which returns the value of expr from a preceding row within the partition.

Again, you will find examples and more details in the documentation of your database system.

Conclusion

Working with datetime in SQL can easily get complicated, especially when we start working with
formatting and windows.

But the most common tasks associated with time in SQL are not that complicated.
There's a small number of data types and a large number of helper functions that can help a lot.

Below you can find a list of links to the documentation of some well known OLAP and OLTP databases.

References

Databricks - Spark Date-Time & Functions

Databricks - Spark Windows (ranking)

Databricks - Spark Windows (analytics)

Adding a new data type to a sql query engine (Trino)

Kostas Pardalis — Mon, 06 Feb 2023 04:00:18 +0000

Introduction

In this post I want to take you through a journey that starts with Github Issue #1284 requesting support for nanosecond/microsecond precision in TIMESTAMP for Trino and ends with the merge of Github PR #3783 that added support for the parametric TIMESTAMP type to the Trino query engine.

This journey includes a number of surprises too!

For example the identification of issues on the semantics of the TIMESTAMP type on Trino as explained in a lot of detail in Github Issue #34.

Our goal is to go deep into some important dimensions of a SQL query engine, including the type system and data encoding.

But also get a taste of what it takes to engineer a complex system like a distributed SQL query engine that is being used daily by thousands of users.

This is going to be a long post, so buckle up!

The Problem

When working with time in a digital system like a computer, precision is one of the things that we deeply care about. Especially when time is important for the tasks we want to perform.

Some industries like finance are more sensitive to time
measurements than others but in any case, we want to know well the semantics of the time data types we work with.

In 2020 FINRA also known as the Financial Industry Regulatory Authority, submitted a proposal to the SEC for the change of rules relating to the granularity of timestamps in trade reports.

The proposal suggests to start tracking trades using nanosecond time granularity.

A system that supports nanosecond timestamp precision,
can work with timestamps in millisecond precision without losing any information. The opposite is not true though.

Trino can support up to picosecond timestamp precision!

Trino is a technology used heavily by the financial sector.
This is one of the reasons the Trino community started looking into updating the TIMESTAMP type in Trino.

SQL Data Types

A data type is a set of representable values and the physical representation of any of the values, is implementation-dependent.

The value of a Timestamp might be 1663802507 but how it is physically represented is left to the database engineer to decide.

Usually, Datetime types (including Timestamp) are physically implemented using 32/64-bit integers.

The SQL Specification allows for arbitrary precision of Datetime types and does that by stating the following:

...SECOND, however,
can be defined to have a that indicates the number of decimal digits maintained
following the decimal point in the seconds value, a non-negative exact numeric value.

Parametarized Data Types

If you are familiar with SQL you will easily recognize the following statement:

CREATE TABLE orders (
orderkey bigint
)

The above will create a table with a column named orderkey and its type will be bigint. There's no way we can parameterize bigint.

This is true for the majority of types but there are a couple exceptions, with timestamps being one of them.

For reasons that we will investigate later, it does make a difference when you are dealing with a timestamp in milliseconds versus picoseconds.

We'd like to allow the user to define the granularity and take advantage of that in optimizing how the data is manipulated.

In Trino the Timestamp type looks like this Timestamp(p),
where p is is the number of digits of precision for the fraction of seconds.

the following SQL statement:

CREATE TABLE orders (
orderkey bigint,
creationdate timestamp(6)
)

will create a table orders with a column named creationdate of type timestamp, with 6 digits of precision for the fraction of seconds.

Adding a new Data type

Adding a new type, or in this case updating an existing one with new semantics, is a risky task.

we are dealing with a system that is being used by thousands of users daily and any change to a type might end up with literally thousands of SQL queries that are broken.

Let's see some of the considerations that the Trino team had during the design phase and then we will elaborate more on
each one of them.

First we have performance considerations.
How can we make sure that we deliver the best possible performance when dealing with timestamps?
The trick here is to consider different representations and functions based on the precision the user has defined.

There's the major issue of backward compatibility.

How do we ensure that the Timestamp semantics of the current implementation will not break with the introduction of parameterization?

This compatibility is not just for the type itself but also for all the functions that are related to this type.

Then we have the added complexity of Trino being a federated query engine.

We need to make sure that the new Timestamp type can be correctly mapped to the equivalent types of each connector supported and do that in the most performant way possible.

Finally, we need to make sure that we handle correctly data type conversions between different types and precisions.

Each one of the above considerations is a huge fascinating topic on its own, so let's dive in!

From Logical to Physical representation

We physically implement timestamps using integers.
Obviously a 32bit integer can represent more timestamps than an 8bit one.

So how many bits do we need for picosecond precision?

The answer can be found in the design document for the variable precision datetime types.

For p=12 we need 79 bits which is a bit more than 64bits which is what the Long Java type supports. To represent higher than p=7 resolution Timestamps, we will need to come up with our own Java representation.

what Trino does is the following:

We will have one short encoding for any timestamp precision that can be represented by a Java Long Type. This is for an p<=6.

We will also introduce a long encoding for any timestamp precision p > 6 and up to the maximum we want to support, which in this case is 12.

The fractional part will be 16 or 32 depending on what precision we want to support at the end.

How does this look when implemented?

Let's see the current Trino implementation

The LongTimestamp Class implements the type we described in the previous diagram and as you can see there's one long java type that is used to hold the seconds and then a int java type (which is a 32bit type) that is used to represent the fractional part of seconds with precision up to 32 bits.

There's also an implementation for a Short Timestamp.

let's see how this looks like in Trino

We want the Short Timestamp to be represented by a 64bit integer and to do that, we explicitly associate this class with the Java long.class.

Most of the implementation has been omitted to make it easy to understand how Short Timestamp is represented, but the writeLong method is a good place to see how the type is assumed to be a long.

You might be wondering why we make our lifes so hard and we don't just simplify things by having only one Timestamp physical representation.

The answer to this is performance. Trino is one of the most performant SQL query engines in the industry and this wouldn't have happened if we didn't squish every little bit of performance we could.

When we talk about performance, there are two types we care about. One is storage performance, e.g. how we can reduce the storage we require to store information.

The other one is processing time performance, or how we can make things run faster.

Saving Space & Time

If we take a look at the LongTimestampType implementation

Trino is using a BlockBuilder to write and read data for the specific type.
What is interesting in the above code snippets is that:

The size of the type is used as information passed to the BlockBuilder
There's a special implementation of a BlockBuilder for Int96, if you remember, earlier we concluded that we will need 64 + 32 = 96 bits to represent our Long Timestamps.

If we take a look at the ShortTimestampType implementation, we'll notice that the BlockBuilder used is different.

You will notice that the API used is the same but the BlockBuilder used is different, now we use a LongArrayBlockBuilder instead of a Int96ArrayBlockBuilder.

The Block part of the Trino SPI is a very important part of the query engine as it is responsible for the efficient encoding of the data that is processed

Any decision in the Block API can affect greatly the performance of the engine.

Now you might wonder why I went through all these examples and why the Trino engineers went through the implementation of different block writers.

The reason is simple, it has to do with the space efficiency of the data types.

Remember that the ShortTimestampType is represented by a 64 long java type while a LongTimestampType by a 64 long java type plus a 32 int java type. Based on that, the memory required to store a timestam of each type is:

ShortTimestampType: 8 bytes
LongTimestampType: 16 bytes

We need twice the space for a timestamp with precision more than microseconds.
Keep in mind that if we use a LongTimestampType we will use 16 bytes of memory even if we don't have picosecond precision.

This is 2x the need in memory and that's a lot when we are working with petabytes of data!

Let's see now what happens with time complexity and if there's any performance difference for typical type operations like comparison between the two timestamp types.

To do that we will need the implementation of ShortTimestampType and LongTimestampType let's check the difference for the comparisonOperator.

The ShortTimestampType comparisonOperator involves one comparison that uses the Long.compare method.

On the other hand the LongTimestampType might potentially involve a second comparison, so in the worst case we have to compare one long and one int type.

There are substantial performance gains both in time and storage by adopting this more complex, dual type, timestamp implementation.

Decisions like this one in every part of the engine, is what makes Trino such a performant query engine.

Move Fast but Don't Break things

Trino is used by thousands of users on a daily basis so it's super important, whenever something new is introduced to e
nsure that nothing will break.

Backward compatibility is important and it's part of the design of any new feature.

To ensure backward compatibility it was decided to first tackle the language and the data types by maintaining current semantics while adding parametarization.

By current semantics we mean the precision that was supported by Trino at that time which was p=3.

Updating the types alone is not enough though, there are a number of special functions that have to be parameterized.

These functions are current_time, locatime, current_timestamp and localtimestamp where the user will be able to provide a parameter for precision while supporting the current semantics as defaults.

Together with the above mentioned functions all the connectors that accept the timestamp types in their DDL will have to be updated together with all functions and operators that operate on these types.

This is just the first step in building support for parametrized timestamp types and the purpose of this step is to introduce the appropriate changes without breaking anything to the existing users by enforcing the right defaults.

We also need to ensure that there's a way to safely cast a Timestamp with p1 precision to a Timestamp with p2 precision.

To do that, we need to implement logic for truncation and rounding.

The logic for these operations can be found in the implementation of the Timestamps class.

The above two methods are implementing logic for truncating micros to millis (the first one) and rounding micros to millis (the second one).

Time is Relative

Trino is a federated query engine, that means that it allows someone to execute a query on it that is going to be pushed down to different systems.

These systems usually do not share the same semantics. This is especially true for date/time types.

For Trino to consider a new feature for connectors, it first has to be implemented and released for Hive.

Hive is important because it is heavily used together with Trino but because it also acts as the reference implementation for any other connector.

The first step was to add support for variable-precision timestamp type for Hive connector. You might notice on this issue that the Hive metastore supports timestamps with optional nanosecond precision.

What that means is that although Trino can handle up to picosecond time resolution, when working with Hive, we can only use up to nanosecond.

This is a common pattern and different datastores will have different restrictions, for example the Postrgres connector can handle up to microsecond resolution.

That means that the person who implements support for the new Timestamp parameters for the Postgres connector will have to account for this and ensure that the right type casts happen.

After Hive has been supported a number of other connectors and client followed, e.g. updating the CLI, the JDBC and the Python clients. These are some of the most frequently used connectors and clients.

Conclusion

There are always trade-offs when you are working with temporal data, in most cases you will have to change multiple times the way you work with time while modeling and analysing your data.

The responsibility of the query engine is to provide you with all the tools you need to do that while ensuring the soundness and correctness of data processing.

Now that you have a basic understanding of how types work in Trino, I encourage you to take a deeper look into the codebase and dig deeper into how these types are serialized and then processed.

Forem: Kostas Pardalis

How to Build a Deep Research Agent with Pydantic AI

2. What we'll build

3. Setting up fenic

Install with uv

4. Loading the Hacker News data

5. Defining research queries

6. Adding LLM-powered analysis

Summarize threads with structured output

7. Putting it together as an agent loop

8. Where to go next

DX UX = U DX UX =

Why you should keep an eye on Apache DataFusion and its community.

What are you talking about, dude?

First, the technology

The community

Governance & ownership

The stars are aligned

Conclusions

A glimpse into the future of data processing infrastructure.

First, what is Velox?

What happened at VeloxCon this year

Take #1: For analytical workloads, it's all about performance optimization

Related talks

Take #2: Defining and measuring performance is a very hard problem

Related Talks

Take #3: Databricks moats are holding strong

Related Talks

Take #4: Databricks might not be in trouble, but maybe Snowflake is?

Related Talks

Take #5: We need more developer tooling around data infrastructure

Talks to check

Take #6: There's a tectonic swift in the importance of data workloads

Related Talks

Take #7: Hardware is still the catalyst for innovation in data processing

Related Talks

Outro

WTF is a Vector Database?

WTF is a Vector Database?

It’s a database

Database to DBMS

Data Models & Databases

What is a vector?

Why do we need vectors?

Embeddings

Enter Transformers

let’s put everything together

MLOps is 98% Data Engineering

MLOps is Mostly Data Engineering

Introduction

What is inside an MLOps platform?

Deployment & Serving of Models

Model Quality and Monitoring

Model Training

Feature Stores

Final Thoughts

What is a Data Contract?

A Tutorial on SQL Window Functions.

Working with SQL Window Functions

SQL Window FunctionsSQL Window Functions

What comes after partitioning ?

Windows and Frames

What about Nulls? Do they matter?

Enough with theory, let’s have fun!

The case of Sessionization

The data

The queries

Conclusion

References

A Practical Guide to SQL Dates with DuckDB.

A pragmatic guide on working with dates in SQL using DuckDB

Computers are simple

Serializations are not that simple though

JSON

CSV

Protocol Buffers

So what did we learn here?

Some Examples with DuckDB

Lessons Learned

Consistent Semantics