Forem: Deb

Streaming SQL in Stateful DataFlows

Deb — Sat, 22 Feb 2025 22:37:35 +0000

Streaming SQL Functionality

SQL Streaming Queries and Stream Processing Operations is released in Stateful DataFlow Beta 7 running on Fluvio 0.15.2

With SQL Streaming on Stateful DataFlow you can:

Run ad-hoc queries on saved state objects and materialized views based live event streams.
Use SQL queries to run stream processing operations in data flows.

For those who are not aware of Fluvio or Stateful DataFlow yet:

Fluvio - Open Source distributed streaming engine written in Rust.
Git Repo - https://github.com/infinyon/fluvio

Stateful DataFlow - Stream processing layer built on Fluvio built using the wasm component model.

Example projects to try on your machine - https://github.com/infinyon/stateful-dataflows-examples/

SQL: From Static Tables to Streaming Data

Remember when SQL was the only way to talk to your data? It wasn't just a query language - it was the query language. But its story goes deeper than syntax.

The Universal Language of Data

Just as merchants in medieval Mediterranean ports needed a shared language to trade (that's where "lingua franca" came from), the tech world needed SQL to make data accessible across different systems and teams.

If you're in a room with a DBA, a data analyst, and a business analyst. What's the one language they all speak? Likely SQL.

SELECT product_name, COUNT(*) as orders
FROM sales
WHERE region = 'EMEA'
GROUP BY product_name

Look familiar? Whether you're running Oracle, Postgres, or MySQL, this just works. Well sort of!

Why SQL Still Matters

Three key factors made SQL a long-term utility that stood the test of time:

It's Human-Friendly
Instead of telling machines HOW to get data, you just say WHAT you want. SELECT * FROM users WHERE status = 'active' reads almost like English.
It's Everywhere
From startups to Fortune 500s, SQL skills travel. Write once, run anywhere - from healthcare to fintech.
It Just Works
Need to analyze sales data? Track user behavior? SQL's got you covered, backed by decades of tooling and optimization.

The Real-Time Challenge

In a world of Artificial Intelligence, Web3, and global markets, event streaming is no longer a luxury - it's a basic need. Ask yourself:

Is your application combining data from multiple sources in real-time?
Are your customers happy with stale insights?
Do you need fresh data on demand?

Bridging Static and Streaming

What if you could use familiar SQL syntax for real-time data processing? What if your team could leverage their existing SQL skills for stream processing?

We've been exploring these questions and implementing solutions that bring SQL's simplicity to streaming data. Want to see how? Check out the full article where we dive into:

Practical examples using NY Transit data
Real-world streaming SQL queries in action
How to implement stream processing without learning a new language

Longer article on InfinyOn blog.

5 Learnings from sharing Kafka vs Fluvio Benchmarks on Reddit

Deb — Fri, 14 Feb 2025 00:22:53 +0000

To have a readable blog, all the links are at the bottom except the link to the Fluvio project.

Benchmarking Fluvio: Community Insights and the Future of Streaming

Yesterday, I shared a blog on benchmarking results comparing Fluvio, our next-generation streaming engine, with Apache Kafka.

The response from the Rust community was encouraging, with over 30,000 impressions, 80+ upvotes, and 40+ comments in just 24 hours. The feedback was invaluable, and I want to share the 5 things I learned from all the developer feedback.

What is Fluvio?

Fluvio is a distributed streaming engine built in Rust over the past six years. While it follows Apache Kafka's conceptual patterns, it introduces programmable design patterns through Rust and WebAssembly-based stream processing called Stateful DataFlow (SDF). This makes Fluvio a complete platform for event streaming.

Key Community Insights

1. Developers care a lot about the benchmark environment.

The community emphasized the importance of comprehensive testing environments:

Need for bare metal servers to eliminate virtualization artifacts
Production-grade setups with proper replication (factor of 3)
Large-scale validation with terabyte-scale live data

The ideal benchmarks will be using real-world data from Blockchain, High-Frequency Trading, or Ad-Tech on bare metal servers and compare multiple systems like Kafka, RedPanda, Pulsar.

2. Intelligent developers know about the trade-offs of using different hardware.

Developers highlighted several hardware-specific considerations:

ARM Graviton chips' latency variations in virtualized environments
Importance of testing across different CPU architectures including x86
Thermal throttling differences between consumer laptops and server-grade hardware

3. Seasoned developers want production-ready configuration for each solution being configured

Runtime mechanics need to reflect real-world scenarios:

Specific JVM and Garbage Collector configurations for Kafka benchmarking
Resource utilization patterns under various loads
Multi-node deployment testing at scale

4. While benchmarks are great benchmarking in mature categories require mention of table-stakes features

Key functionality developers look for:

Consumer groups for ordered delivery per partition
Stream and batch processing capabilities
Robust delivery guarantees

5. Benchmarks also immediately makes developers think of reliability and debugging experience

Critical operational features:

Dead letter queue implementations
Retry strategies for network issues
Delivery proof mechanisms beyond best-effort

The New Streaming Paradigm

Event streaming is a basic pattern in a world filled with agents.

Wise developers focus on:

Practical performance over theoretical maxima
Transparent benchmarking methodology
Intuitive deployment and management

Our Vision for Next Generation Data Intensive Applications

We believe the next wave of intelligent applications will come from builders who:

Challenge traditional infrastructure assumptions
Require millisecond latencies at scale
Prioritize resource efficiency

We don't just need faster systems - we need smarter ones that don't drain budgets or sanity.

The future belongs to systems that balance raw performance with operational wisdom. The question isn't just about speed—it's about enabling rapid innovation delivering an intuitive developer ergonomics while maintaining efficiency and reliability.

Get Involved

Explore Fluvio:

Star our GitHub repository
Share with your friends
Try our Rust-core development
Experiment with our WASM based programmability
Check out InfinyOn Cloud

How to build self hosted Github Stargazer Bots on Discord and Slack using Fluvio.

Deb — Sat, 27 Jan 2024 05:56:51 +0000

Introduction

This is my first article of 2024. Happy New Year to all of you.

I have to begin with a confession. I planned to put this article out more than month ago as a Christmas present. But I was spread too thin. Wearing multiple hats.

While this did not end up as the holiday present that I wanted to share with the open source community, I hope that it is welcome present in early 2024.

Github activity is near and dear to open source project owners and power users. It is motivating for me to see continuous streams of stars and forks activity. 🤩

In this blog, I will share my workflow to build a simple streaming data pipeline using Fluvio to collect and aggregate Star 🌟 and Fork 🎏 activity from GitHub, and build bots on Slack and/or discord to send automated updates on the Star and Fork changes.

Navigation

About Fluvio

Fluvio Open Source
Configurations
Pick the workflow that is relevant to you.

Shortcut: All the files, configs, and the commands in the blog are in this Git Repository

Workflow

Fluvio Local Setup - Self Hosted
API Keys and Secrets
Inbound GitHub Connector
Outbound Discord Connector
Outbound Slack Connector
Fluvio Docs Reference

Fluvio Docs Reference

Fluvio Open Source

Fluvio is a data streaming system written in rust and web assembly. We have been building Fluvio for nearly 5 years. Fluvio is built as a cloud native distributed streaming system from the ground up. We are building ferociously to release stateful stream processing and time window based materializations with support for several web assembly compatible languages soon!

We have been tracking various GitHub activities including Stars and Forks on our Fluvio Open Source repository using Fluvio. We have channels on our company Slack and community Discord where we get continuous and real-time alerts on Star 🌟 and Fork 🎏 activity.

Of course you can get a lot more than stars and forks:

If you would like a full blown GitHub Dashboard that you can implement, ask for it in the comments and I will open source it if folks want it.

Following this blog, you will have your very own streaming data pipeline to get Star 🌟 and Fork 🎏 activity on your own Slack or Discord channel. Let's go!

Configurations

Fluvio Cluster Deployment: This tutorial will show you how to self host everything locally. In another tutorial, I will share this workflow using our managed cloud if you'd like to try that flow.

GitHub API: The inbound data comes from the GitHub API and you will need your GitHub API Key to get higher query limits.

As per GitHub docs, GitHub Apps authenticating with an installation access token use the installation's minimum rate limit of 5,000 requests per hour. If the installation is on a GitHub Enterprise Cloud organization, the installation has a rate limit of 15,000 requests per hour.

If there is no access token then you would be limited to 60 requests per hour. The default config of this blog will assume a slower frequency to work without the access token.

SlackBot / DiscordBot: In terms of the bots sending you updates, you can have them run on Discord or Slack or both. For the bots there is a pretty simple workflow to create applications to interact with the Slack and Discord APIs. I will share the relevant workflow in this blog and provide references for the official docs at the bottom of the post.

Fluvio Local Setup - Self Hosted

Install and launch Fluvio

Fluvio installation is managed by Fluvio Version Manager shortened to fvm. To install Fluvio run the command:

curl -fsS https://hub.infinyon.cloud/install/install.sh | bash

Copy and run the last line of the install log on the terminal to add the install directory to the PATH variable.

Start a local Fluvio cluster

fluvio cluster start

Once the cluster is running you will need to download the connectors and the smart modules. I linked to organize them in a single working directory.

mkdir github-stargazer-local && cd "$_"

Create a free account on InfinyOn Cloud

InfinyOn Cloud Hub is a repository of pre-built connectors, smartmodules and other workflow components.

Create a free account using the InfinyOn Cloud sign-up page to access the InfinyOn Hub.

Download connectors

We have full blown development kit to build connection or integration to practically any custom data source or sink. For this workflow we will use a couple of prebuilt connectors to accomplish our task.

Search available connectors:

cdk hub list

You would see output like this:

  CONNECTOR                          Visibility
  infinyon-labs/graphite-sink@0.1.2  public
  infinyon/duckdb-sink@0.1.0         public
  infinyon/http-sink@0.2.6           public
  infinyon/http-source@0.3.1         public
  infinyon/ic-webhook-source@0.1.2   public
  infinyon/kafka-sink@0.2.7          public
  infinyon/kafka-source@0.2.5        public
  infinyon/mqtt-source@0.2.5         public
  infinyon/sql-sink@0.3.3            public

Download the http source and sink connectors:
http source

cdk hub download infinyon/http-source@0.3.1

http sink

cdk hub download infinyon/http-sink@0.2.6

Download smart modules

Smart Modules are web assembly based data transformers. Similar to connectors, We have full blown development kit to build custom data transformation logic. In this case we will use a couple of prebuilt smart modules.

Search available smart modules:

fluvio hub sm list

You would see output like this:

  SMARTMODULE                              Visibility
  infinyon-labs/array-map-json@0.1.0       public
  infinyon-labs/dedup-filter@0.0.2         public
  infinyon-labs/json-formatter@0.1.0       public
  infinyon-labs/key-gen-json@0.1.0         public
  infinyon-labs/regex-map-json@0.1.1       public
  infinyon-labs/regex-map@0.1.0            public
  infinyon-labs/rss-json@0.1.0             public
  infinyon-labs/stars-forks-changes@0.1.2  public
  infinyon/jolt@0.3.0                      public
  infinyon/json-sql@0.2.1                  public
  infinyon/regex-filter@0.1.0              public

Download the stars-forks-changes and jolt smart modules:
stars-forks-changes

fluvio hub sm download infinyon-labs/stars-forks-changes@0.1.2

jolt

fluvio hub sm download infinyon/jolt@0.3.0

That's all you will need for setup. Topics will be created automatically by the connectors.

If you want to remove the fluvio cluster at any point shutdown the connectors(command in the sections below):

fluvio cluster delete

Back to Navigation

Apps and API Keys:

We will store the API Keys and secrets in a file for self hosted deployment.

Create a file named secrets.txt to add the relevant API Keys once we create them.

Create the file using your favourite text editor and add the following variables.

DISCORD_TOKEN=
SLACK_TOKEN=
GITHUB_TOKEN=

Back to Navigation

Inbound GitHub Connector

To connect to the GitHub API, you can create an API key based on the documentation here: https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens

Create a file and call it github.yaml with the following configuration pattern:
You need to put the API endpoint of your repository, and modify the interval. If you use an API Key you can make it 1s, without an API Key you can go as fast as 60s intervals or more. I am keeping it at 120s.

apiVersion: 0.1.0
meta:
  version: 0.3.1
  name: github-stars-inbound
  type: http-source
  topic: github-stars
http:
  endpoint: 'https://api.github.com/repos/[your_org/your_repo]'
  method: GET
  interval: 60s
transforms:
  - uses: infinyon/jolt@0.3.0
    with:
      spec:
        - operation: shift
          spec:
            "stargazers_count": "stars"
            "forks_count": "forks"

Run the http-source connector using the configuration with the following command:

You can skip the secrets parameter if you have not set a GitHub API Key.

cdk deploy start --ipkg infinyon-http-source-0.3.1.ipkg --config github.yaml --secrets secrets.txt

All configuration in the context of Fluvio data flows are YAML based. We will have a configuration for each source and sink system to deploy the connectors.

Inspect or Shutdown the GitHub Connector:

If you want to see the status, you can run

cdk deploy list

If you want to shutdown the connector, you can run

cdk deploy shutdown --name github-stars-inbound

Back to Navigation

Outbound Discord Connector

To get the alerts for stars and forks in a Discord channel you need:

a Discord server with admin access
a new or existing Discord Channel for the alerts
a Discord Application

Create a Discord Application: Discord Apps Docs: To create the discord bot simply go into server settings -> Integrations -> New Webhook -> Name the webhook, pick the channel, copy the Webhook URL.

The Discord token is the unique identifier of your workspace.

Create a configuration file to connect to Discord - call it discord.yaml:

apiVersion: 0.1.0
meta:
  version: 0.2.6
  name: discord-stars-outbound
  type: http-sink
  topic: github-stars
  secrets:
    - name: DISCORD_TOKEN
http:
  endpoint: "https://discord.com/api/webhooks/${{ secrets.DISCORD_TOKEN }}"
  headers:
    - "Content-Type: application/json"
transforms:
  - uses: infinyon-labs/stars-forks-changes@0.1.2
    lookback:
      last: 1
  - uses: infinyon/jolt@0.3.0
    with:
      spec:
        - operation: shift
          spec:
            "result": "text"

Start the Discord Connector:

cdk deploy start --ipkg infinyon-http-sink-0.2.6.ipkg --config discord.yaml --secrets secrets.txt

You will now receive notifications on Stars and Forks activity in the Discord channel that you chose.

Inspect or Shutdown the Discord Connector:

If you want to see the status, you can run

cdk deploy list

If you want to shutdown the connector, you can run

cdk deploy shutdown --name discord-stars-outbound

Back to Navigation

Outbound Slack Connector

To get the alerts for stars and forks in a Slack channel you need:

a slack workspace with admin access
a new or existing Slack Channel for the alerts
a Slack Application

Create a Slack Application: Follow the steps in the quickstart to create a slack application, and activate incoming webhook in features, copy the URL, and install the slack app in your workspace.

The Slack token is the unique identifier of your workspace.

Create a configuration file to connect to Slack - call it slack.yaml:

apiVersion: 0.1.0
meta:
  version: 0.2.6
  name: slack-stars-outbound
  type: http-sink
  topic: github-stars
  secrets:
    - name: SLACK_TOKEN
http:
  endpoint: "https://hooks.slack.com/services/${{ secrets.SLACK_TOKEN }}"
  headers:
    - "Content-Type: application/json"
transforms:
  - uses: infinyon-labs/stars-forks-changes@0.1.2
    lookback:
      last: 1
  - uses: infinyon/jolt@0.3.0
    with:
      spec:
        - operation: shift
          spec:
            "result": "text"

Start the Slack Connector:

cdk deploy start --ipkg infinyon-http-sink-0.2.6.ipkg --config slack.yaml --secrets secrets.txt

You will start receiving notifications on Stars and Forks activity in the Slack channel that you chose.

Inspect or Shutdown the Slack Connector:

If you want to see the status, you can run

cdk deploy list

If you want to shutdown the connector, you can run

cdk deploy shutdown --name slack-stars-outbound

Back to Navigation

Fluvio Docs Reference

Here are some relevant docs that you can look at for further context:

InfinyOn Docs Master
GitHub to Slack
GitHub to Discord
Fluvio CLI Docs

8 Rusty open source data projects to watch in 2024 🤩

Deb — Thu, 07 Dec 2023 04:54:43 +0000

Context

The open source ecosystem is my favourite aspect of tech. It has been since the early days of BIG data in late 2000's. As software, infra, and data engineers, open source projects are a source of great inspiration for all of us. Open source projects are also a great help in building solutions to validate and resolve customer challenges without reinventing the wheel.

In my first post wearing the developer advocacy hat, as a technical product leader, I want share 5 data projects that I am tinkering during the December break going into 2024.

First I will share why data projects? and why Rust? Then I will share 5 open source projects that I am looking deeper into over the coming days and months.

Why data projects?

Data is at the centre of attraction in tech. For the past 12 years, I have heard it said in several software companies that "we want to treat data as a first class citizen." The reason and reality that surrounded the statement is that data has been anything but a first class citizen in majority of organizations.

Data has come in the limelight every now and then. But it's been off and on.

In my personal experience, data got bursts of attention since the early 21st century with the rise of the big data ecosystem, distributed file systems, map-reduce, NoSQL, machine learning, deep learning, artificial intelligence, and more recently large language models.

I have been working with complex data use cases since 2009. While I got enthused by augmented and virtual reality for a while, blockchain and cryptography for a bit, my affinity towards data is consistent throughout my career.

There has never been a better time to dive deep into data.

Why Rust based?

Since the my first C/C++ programs implementing trees, and linked lists in early 2000s, I have written code on C#, ASP.net, Java, Python, and R between 2010 and now. It took extra work to keep up with writing code since moving to technical product management in 2014.

That's why I have not tried Go or Rust yet, and have remained content with SQL and Python scripts. As someone with a bit of experience in data centric applications, I am arguing that the existing stack has served us well. But the demands of data centric applications have increased along with the innovations in hardware infrastructure and we need to adapt.

Rust has been dubbed as a hype for a bit. But it is boring. It's complex to get started. It is at the system level as C++ with some additional promise.

Yet it is the most loved language by it's adopters and it has proved to be better than alternative system level programming languages in terms of performance, security, reliability, efficiency. Several big tech companies have made the incredibly rare choice of rewriting critical distributed systems in Rust.

I have been working since January 2023 with a team that has been building with Rust since 5 years. I believe Rust will inevitably power several distributed data systems as I dabble with various projects. It's about time I took a deeper dive!

The 8 Rust based open source data projects to try in 2023

Fluvio [GitHub: https://github.com/infinyon/fluvio]

The word Fluvio comes from Latin fluvius ‘river’. A river is made up of continuous 'streams' and powers entire civilizations through agriculture. It is an apt name for a data streaming platform.

I learnt about Fluvio in mid 2022 and it was a project recommended by a genius tech architect friend. I left my last role and higher pay to work in product leadership at InfinyOn the creators of Fluvio in January 2023. I have used the cloud instances a lot more since I did not want to spend the additional time configuring Kubernetes on my local system to play around with the open source version.

Here is the description that I recently updated on the Fluvio GitHub Repo:

Lean and mean distributed stream processing system written in rust and web assembly.

I have used the InfinyOn Managed Cloud since I did not want to spend the additional time configuring Kubernetes on my local system to play around with the open source version.

We just released a single binary installer for Fluvio Open Source with no dependency on Kubernetes. And I am now going to start tinkering around with the dev kits with the open source project. I will be sharing more about the workflows soon.

Arroyo [GitHub: https://github.com/ArroyoSystems/arroyo]

Arroyo is a scalable stateful stream processing engine that is written in Rust and provides a SQl interface. Arroyo is Y-Combinator backed startup project with a mission to build a streaming-first future.

About Arroyo on the GitHub Repo is succinct:

Distributed stream processing engine in Rust

The ability to express stateful streaming operations using declarative SQL is what data engineers love. One top of that Arroyo has an integration with Fluvio with a rust based engine which makes it a great combination.

I am looking froward to trying out the arroyo example projects and running SQL based state operations on data from Fluvio topics. I will be sharing all about these workflows everywhere.

Qdrant [GitHub: https://github.com/qdrant/qdrant]

Quadrant is a vector database and similarity search engine written in rust. It's making waves powering better search capabilities for several well known products like Twitter, GitBook and more.

The GitHub Repo About section describes Qdrant as::

Qdrant - High-performance, massive-scale Vector Database for the next generation of AI. Also available in the cloud https://cloud.qdrant.io/

The ability to store embeddings of large language models and deep neural network based auto encoders is in high demand given the massive accessibility and popularity of large language models.

Search and chat are highly used functionalities, and I am looking forward to testing out building out the demo projects and digging into the interfaces in quadrant.

Lancedb [GitHub: https://github.com/lancedb/lancedb]

LanceDB calls is a database for vector search. Seems like Lancedb is a bridge between, Python and Rust. Lancedb core is written in rust and it supports native Python and Javascript/Typescript.

The GitHub Repo About section describes LanceDB as:

Developer-friendly, serverless vector database for AI applications. Easily add long-term memory to your LLM apps!

I am looking forward to testing out the ecosystem integrations with LangChain 🦜️🔗, LlamaIndex 🦙, Apache-Arrow, Pandas, Polars, DuckDB and more on the way. It would be interesting to see the similarities and differences between LanceDB and Qdrant.

Linfa [GitHub: https://github.com/rust-ml/linfa]

Having spent a decent amount of time over the years tinkering around with Scikit-Learn in Pyhton, I am curious to try out Linfa. It would be cool to try out the basic machine learning and statistical models to get a feel for how to implement them in rust using a reasonable framework.

About Linfa on GitHub is precise:

A Rust machine learning framework.

Linfa in Italian translates to sap in English. The project "aims to provide a comprehensive toolkit to build Machine Learning applications with Rust. Kin in spirit to Python's scikit-learn, it focuses on common preprocessing tasks and classical ML algorithms for your everyday ML tasks."

I hope this becomes a decent bridge from Python to Rust. Data processing for machine learning and statistical modelling has been a useful way for me to learn about different use cases, and I am looking forward to playing around with Linfa.

Polars [GitHub: https://github.com/pola-rs/polars]

Polars has the highest momentum among data engineers who are used to Pandas DataFrames. I have experience with Pandas in a decent bit of experimentation. Interesting thing is that polars is a DaatFrame interface on top of an OLAP engine.

About Polars from GitHub reads:

Dataframes powered by a multithreaded, vectorized query engine, written in Rust.

I am looking forward to specifically trying out hybrid streaming functionality on Polars and testing out the possibilities of using the interface for interactive real time visualization. I also want to test out the Pyo3 extensions for Python functions complied in Rust.

Lance [GitHub: https://github.com/lancedb/lance]

Lance is a columnar data format which deliberately calls out the ease of converting from parquet files. Lance is the format used by the Lancedb vector database. Lance is also compatible with Pandas, DuckDb, Polars, Pyarrow.

About Lance from GitHub reads:

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, with more integrations coming.

Converting parquet files with 2 lines of code for 100X faster random access is insane, given the popularity of the parquet format. I am looking to try out a couple of end to end streaming flows, switch around the data formats and compare the different engines in different parts of the pipeline to identify the integration challenges, pros and cons of using different systems.

Surrealdb [GitHub: https://github.com/surrealdb/surrealdb]

Surreal is the most popular project on this list in terms of community engagement. Surreal is a document-graph data store for modern web applications.

About SurrealDB from GitHub reads:

A scalable, distributed, collaborative, document-graph database, for the realtime web.

Surreal is interesting as it is transactional, strongly typed, and offers granular access control. It also offers a degree of multi-modality and offers representations in tables, documents, and graphs making it an interesting player in the distributed serverless databases.

I am looking forward to checking out the community projects and trying out theirs data models for streaming data.

Conclusion

It's always a party in tech, and with new things getting announced everyday there would surely be more options in the coming year. There are few other projects that I am looking at trying out in the process including Apache OpenDAL [https://opendal.apache.org/], OneTable [https://github.com/onetable-io/onetable] are a couple that comes to mind.

Anyways, it sounds like 2024 is going to take my back to my software and data engineering days in early 2010s, and I am looking forward to sharing about my experiences with these tools, performance profiles, opinions about the landscape and more.

I am also a big believer in sharing my work in the places where people find their information. I have a single link to my profiles on various developer communities, and social media. Feel free to connect with me wherever you engage with developer insights.

Here is my link tree - https://www.singlel.ink/u/debroychowdhury

See you around.

Deb.