Forem: avital trifsik

Introducing Memphis Functions

avital trifsik — Thu, 09 Nov 2023 14:14:39 +0000

The story

Organizations are increasingly embracing real-time event processing, intercepting data streams before they enter the data warehouse, and embracing event-driven architectural paradigms. However, they must contend with the ever-evolving landscape of data and technology. Development teams face the challenge of maintaining alignment with these changes while also striving for greater development efficiency and agility.

Further challenges lie ahead:

Developing new stream processing flows is a formidable task.
Code exhibits high coupling to particular flows or event types.
There is no opportunity for code reuse or sharing.
Debugging, troubleshooting, and rectifying issues pose ongoing challenges.
Managing code evolution remains a persistent concern.

The shortcomings of current solutions are as follows:

They impose the use of SQL or other vendor-specific, lock-in languages on developers.
They lack support for custom logic.
They add complexity to the infrastructure, particularly as operations scale.
They do not facilitate code reusability or sharing.
Ultimately, they demand a significant amount of time to construct a real-time application or pipeline.

Introducing Memphis Functions

The Memphis platform is composed of four independent components:

Memphis Broker, serving as the storage layer.
Schemaverse, responsible for schema management.
Memphis Functions, designed for serverless stream processing.
Memphis Connectors, facilitating data retrieval and delivery through pre-built connectors.

Memphis Functions empower developers and data engineers with the ability to seamlessly process, transform, and enrich incoming events in real-time through a serverless paradigm, all within the familiar AWS Lambda syntax.

This means they can achieve these operations without being burdened by boilerplate code, intricate orchestration, error-handling complexities, or the need to manage underlying infrastructure.

Memphis Functions provides this versatility in an array of programming languages, including but not limited to Go, Python, JavaScript, .NET, Java, and SQL. This flexibility ensures that development teams have the freedom to select the language best suited to their specific needs, making the event processing experience more accessible and efficient.

What’s more?

In addition to orchestrating various functions, Memphis Functions offer a comprehensive suite for the end-to-end management and observability of these functions. This suite encompasses features such as a robust retry mechanism, dynamic auto-scaling utilizing both Kubernetes-based and established public cloud serverless technologies, extensive monitoring capabilities, dead-letter handling, efficient buffering, distributed security measures, and customizable notifications.

It’s important to note that Memphis Functions are designed to seamlessly complement existing streaming platforms, such as Kafka, without imposing the necessity of adopting the Memphis broker. This flexibility allows organizations to leverage Memphis Functions while maintaining compatibility with their current infrastructures and preferences.

Getting started

Step 1: Write your processing function
Utilize the same syntax as you would when crafting a function for AWS Lambda, taking advantage of the familiar and powerful AWS Lambda framework. This approach ensures that you can tap into AWS Lambda’s extensive ecosystem and development resources, making your serverless function creation a seamless and efficient process and without learning yet another framework syntax.

Functions can be a simple string-to-JSON conversion all the way to pushing a webhook based on some event’s payload.

Step 2: Connect Memphis to your git repository
Integrating Memphis with your git repository is the next crucial step. By doing so, Memphis establishes an automated link to your codebase, effortlessly fetching the functions you’ve developed. These functions are then conveniently showcased within the Memphis Dashboard, streamlining the entire process of managing and monitoring your serverless workflows. This seamless connection simplifies collaboration, version control, and overall visibility into your stream processing application development.

Step 3: Attach functions to streams
Now it’s time to integrate your functions with the streams. By attaching your developed functions to the streams, you establish a dynamic pathway for ingested events. These events will seamlessly traverse through the connected functions, undergoing processing as specified in your serverless workflow. This crucial step ensures that the events are handled efficiently, allowing you to unleash the full potential of your processing application with agility and scalability.

Gain early access and sign up to our Private Beta Functions waiting list here!

Join 4500+ others and sign up for our data engineering newsletter.

Event-Driven Architecture with Serverless Functions – Part 1

avital trifsik — Mon, 09 Oct 2023 08:38:24 +0000

This is the 1st part of the series “A new type of stream processing״.

In this series of articles, we are going to explain what is the missing piece in stream processing, and in this part, we’ll start from the source. We’ll break down the different components and walk through how they can be used in tandem to drive modern software.

First things first, Event-driven architecture. EDA and serverless functions are two powerful software patterns and concepts that have become popular in recent years with the rise of cloud-native computing. While one is more of an architecture pattern and the other a deployment or implementation detail, when combined, they provide a scalable and efficient solution for modern applications.

What is Event-Driven Architecture

EDA is a software architecture pattern that utilizes events to decouple various components of an application. In this context, an event is defined as a change in state. For example, for an e-commerce application, an event could be a customer clicking on a listing, adding that item to their shopping cart, or submitting their credit card information to buy. Events also encompass non-user-initiated state changes, such as scheduled jobs or notifications from a monitoring system.

The primary goal of EDA is to create loosely coupled components or microservices that can communicate by producing and consuming events between one another in an asynchronous way. This way, different components of the system can scale up or down independently for availability and resilience. Also, this decoupling allows development teams to add or release new features more quickly and safely as long as its interface remains compatible.

The Usual Components

A scalable event-driven architecture will comprise three key components:

Producer: components that publish or produce events. These can be frontend services that take in user input, edge devices like IoT systems, or other types of applications.
Broker: components that take in events from producers and deliver them to consumers. Examples include Kafka, Memphis.dev, or AWS SQS.
Consumer: components that listen to events and act on them.

It’s important also to note that some components may be a producer for one workflow while being a consumer for another. For example, if we look at a credit card processing service, it could be a consumer for events that involve credit cards, such as new purchases or updating credit card information. At the same time, this service may be a producer for downstream services that record purchase history or detect fraudulent activity.

Common Patterns

Since EDA is a broad architectural pattern, it can be applied in many ways. Some common patterns include:

Point-to-point messaging: For applications that need a simple one-to-one communication channel, a point-to-point messaging pattern may be used with a simple queue. Events are sent to a queue (messaging channels) and buffered for consumers.
Pub/sub: If multiple consumers need to listen to the same events, pub/sub style messaging may be used. In this scenario, the producer generates events on a topic that consumers can subscribe to. This is useful for scenarios where events need to be broadcast (e.g. replication) or different business logic must be applied to the same event.
Communication models: Different use cases dictate how communication should be coordinated. In some cases, it must be orchestrated via a centralized service if logic involves some distinct steps with dependencies. In other cases, it can be choreographed where producers can generate events without worrying about downstream dependencies as long as the events adhere to a predetermined schema or format.

EDA Use Cases

Event-driven architecture became much more popular with the rise of cloud-native applications and microservices. We are not always aware of it, but if we take Uber, Doordash, Netflix, Lyft, Instacart, and many more, each one of them is completely based on an event-driven, async architecture.

Another key use case is data processing of events that require massive parallelization and elasticity to changes.

Let’s talk about Serverless Functions

Serverless functions are a subset of the serverless computing model, where a third party (typically a cloud or service provider) or some orchestration engine manages the infrastructure on behalf of the users and only charges on a per-use basis. In particular, serverless functions or Function-as-a-Service (FaaS) allow users to write small functions as opposed to full-fledged services with server logic and abstract away the typical “server” functionality such as HTTP listeners, scaling, and monitoring. To developers, serverless functions can simplify their workflow significantly as they can focus on the business logic and allow the service provider to bootstrap the infrastructure and server functionalities.

Serverless functions are usually triggered by external events. This can be a HTTP call or events on a queue or a pub/sub like messaging system. Serverless functions are generally stateless and designed to handle individual events. When there are multiple calls, the service provider will automatically scale up functions as needed unless parallelism is limited to one by the user. While different implementations of serverless functions have varying execution limits, in general, serverless functions are meant to be short-lived and should not be used for long-running jobs.

How about combining EDA and Serverless Functions?

As you probably noticed, since serverless functions are triggered by events, it makes for a great pairing with event-driven architectures. This is especially true for stateless services that can be short-lived. A lot of microservices probably fall under this bucket unless it is responsible for batch processing or some heavy analytics that push the execution limit.

The benefit of utilizing serverless functions with event-driven architecture is reduced overhead of managing the underlying infrastructure and freeing up developer’s time to focus on business logic that drives business value. Also, since service providers only charge per use, it can be a cost-efficient alternative to running a self-hosted infrastructure either on servers, VMs, or Containers.

Sounds like the perfect match, right? Why isn’t it everywhere?
While AWS is doing its best to push us to use lambda, Cloudflare invests hundreds of millions of dollars to convince developers to use its serverless framework, gcp, and others, it still feels “harder” than building a traditional service in the context of EDA or data processing.

Among the reasons are:

Lack of observability. What went in and what went out?
Debug is still hard. When dealing with future given data, great debugging experience is a must, as the slightest change in its structure will break the function.
Retry mechanism.
Convert a batch of messages into a processing batch.

The idea of combining both technologies is very interesting. Ultimately can save a great number of dev hours, efforts, and add abilities that are extremely hard to develop with traditional methodologies, but the mentioned reasons are definitely a major blocker.

I will end part 1 with a simple, open question/teaser –
Have you ever used Zapier?

Stay tuned for part 2 and get one step closer to learning more about the new way to do stream processing.

Join 4500+ others and sign up for our data engineering newsletter.

Originally published at Memphis.dev By Yaniv Ben Hemo, Co-Founder & CEO at Memphis.dev

Task scheduling with a message broker

avital trifsik — Thu, 17 Aug 2023 08:38:45 +0000

Introduction

Task scheduling is essential in modern applications to maximize resource utilization and user experience (Non-blocking task fulfillment).
A queue is a powerful tool that allows your application to manage and prioritize tasks in a structured, persistent, and scalable way.
While there are multiple possible solutions, working with a queue (which is also the perfect data structure for that type of work), can ensure that tasks are completed in their creation order without the risk of forgetting, overlooking, or double-processing critical tasks.

A very interesting story on the need and evolvement as the scale grows can be found in one of DigitalOcean’s co-founder’s blog post
From 15,000 database connections to under 100.

Any other solutions besides a queue?

Multiple. Each with its own advantages and disadvantages.

Cron
You can use cron job schedulers to automate such tasks. The issue with cron is that the job and its execution time have to be written explicitly and before the actual execution, making your architecture highly static and not event-driven. Mainly suitable for a well-defined and known set of tasks that either way have to take place, not by a user action.

Database
A database can be a good and simple choice for a task storing place, and actually used for that in the early days of a product MVP,
but there are multiple issues with that approach, for example:

Ordering of insertion is not guaranteed, and therefore the tasks handling might not take place in the order they actually got created.
Double processing can happen as the nature of a database is not to delete a record once read, so there is a potential of double reading and processing a specific task, and the results of that can be catastrophic to a system’s behavior.

Traditional queues

Often, for task scheduling, the chosen queue would probably be a pub/sub system like RabbitMQ.

Choosing RabbitMQ over a classic broker such as Kafka, for example, in the context of task scheduling does make sense as a more suitable tool for that type of task given the natural behavior of Kafka to retain records (or tasks) till a specific point in time, no matter if acknowledged or not.

The downside in choosing RabbitMQ would be the lack of scale, robustness, and performance, which in time become increasingly needed.

With that idea in mind, Memphis is a broker that presents scale, robustness, and high throughput alongside a type of retention that fully enables task scheduling over a message broker.

Memphis Broker is a perfect queue for task scheduling

On v1.2, Memphis released its support for ACK-based retention through Memphis Cloud. Read more here.

Messages will be removed from a station only when acknowledged by all the connected consumer groups. For example:

If we have only one connected consumer group when a message/record is acknowledged, it will be automatically removed from the station.
If we have two connected consumer groups, the message will be removed from the station (=queue) once all CGs acknowledge the message.

We mentioned earlier the advantages and disadvantages of using traditional queues such as RabbitMQ in comparison to common brokers such as Kafka in the context of task scheduling. When comparing both tools to Memphis, it’s all about getting the best from both worlds.

A few of Memphis.dev advantages –

Ordering
Exactly-once delivery guarantee
Highly scalable, serving data in high throughput with low 4. latency
Ack-based retention
Many-to-Many pattern

Getting started with Memphis Broker as a tasks queue

Sign up to Memphis Cloud.
Connect your task producer –
Producers are the entities that insert new records or tasks.
Consumers are the entities who read and process them.
A single client with a single connection object can act as both at the same time, meaning be both a producer and a consumer. Not to the same station because it will lead to an infinite loop. It’s doable, but not making much sense. That pattern is more to reduce footprint and needed “workers” so a single worker can produce tasks to a specific station, but can also act as a consumer or a processor to another station of a different use case. The below code example will create an Ack-based station and initiate a producer in node.js –

const { memphis } = require("memphis-dev");

(async function () {
  let memphisConnection;

  try {
    memphisConnection = await memphis.connect({
      host: "MEMPHIS_BROKER_HOSTNAME",
      username: "CLIENT_TYPE_USERNAME",
      password: "PASSWORD",
      accountId: ACCOUNT_ID
    });

    const station = await memphis.station({
      name: 'tasks',
      retentionType: memphis.retentionTypes.ACK_BASED,
    })

    const producer = await memphisConnection.producer({
      stationName: "tasks",
      producerName: "producer-1",
    });

    const headers = memphis.headers();
    headers.add("Some_KEY", "Some_VALUE");
    await producer.produce({
      message: {taskID: 123, task: "deploy a new instance"}, // you can also send JS object - {}
      headers: headers,
    });

    memphisConnection.close();
  } catch (ex) {
    console.log(ex);
    if (memphisConnection) memphisConnection.close();
  }
})();

Connect your task consumer –
The below consumer group will consume tasks, process them, and, once finished – acknowledge them. By acknowledging the tasks, the broker will make sure to remove those records to ensure exactly-once processing. We are using the station entity here as well in case the consumer starts before the producer. No need to worry. It is applied if the station does not exist yet.Another thing to remember is that a consumer group can contain multiple consumers to increase parallelism and read-throughput. Within each consumer group, only a single consumer will read and ack the specific message, not all the contained consumers. In case that pattern is needed, then multiple consumer groups are needed.

const { memphis } = require("memphis-dev");

(async function () {
  let memphisConnection;

  try {
    memphisConnection = await memphis.connect({
      host: "MEMPHIS_BROKER_HOSTNAME",
      username: "APPLICATION_TYPE_USERNAME",
      password: "PASSWORD",
      accountId: ACCOUNT_ID
    });

    const station = await memphis.station({
      name: 'tasks',
      retentionType: memphis.retentionTypes.ACK_BASED,
    })

    const consumer = await memphisConnection.consumer({
      stationName: "tasks",
      consumerName: "worker1",
      consumerGroup: "cg_workers",
    });

    consumer.setContext({ key: "value" });
    consumer.on("message", (message, context) => {
      console.log(message.getData().toString());
      message.ack();
      const headers = message.getHeaders();
    });

    consumer.on("error", (error) => {});
  } catch (ex) {
    console.log(ex);
    if (memphisConnection) memphisConnection.close();
  }
})();

If you liked the tutorial and want to learn what else you can do with Memphis Head here

Join 4500+ others and sign up for our data engineering newsletter.

Originally published at Memphis.dev By Shay Bratslavsky, Software Engineer at @Memphis.dev

You have a chance to save the world! 🔥

avital trifsik — Mon, 17 Jul 2023 12:11:02 +0000

We are happy to announce Memphis #1 hackathon #SaveZakar!📣📣📣

Sponsored by Memphis.dev, Streamlit and Supabase - Join us to save the world from wildfires using real-time data and AI!

What the hackathon is all about? 🌎

Wildfires wreck havoc every year. They take human and animal lives. The fires destroy homes and other property. They destroy agricultural and industrial crops and cause famines. They damage the environment, contribute to global warming, and generate smoke that pollutes the air. Their overall impact runs into billions of dollars and includes incalculable harm to people and animals.

Where and when wildfires will occur is difficult to predict ahead of time. Instead, researchers, federal, state, and municipal governments and international non-profits have invested heavily in early warning systems. If fires can be detected early, firefighters can be deployed to prevent their spread. If people can be notified, they can be evacuated to avoid loss of life.

Your mission 🔥

In this hackathon, you are going to create a wildfire early warning system for the fictional island nation of Zakar. Zakar has been struggling with wildfires in the last few years. As a small island nation, fires are particularly problematic. Importing materials is expensive and takes time, so they must do everything they can to protect their homes, farms, and natural resources. Similarly, with a relatively small geographic footprint, smoke can quickly pollute the air, causing health problems for their people.

Prizes 🏆

Each project will be judged by the following categories:

Creativity
Most informative visualization
Most accurate solution (For the early warning system)
Most interesting architecture
Most interesting algorithm

Besides internal glory 😉, the best project will get the perfect gaming package which includes the following:

SteelSeries Arctis Nova Pro Wireless Multi-System Gaming Headset.
Logitech G Pro Wireless Gaming Mouse - League of Legends Edition.
RARE Framed Nintendo Game Boy Color GBC Disassembled.
Tons of swag from Memphis.dev and Streamlit!

The 2nd best project will receive -

Logitech G Pro Wireless Gaming Mouse - League of Legends Edition.
Tons of swag from Memphis.dev and Streamlit!

Useful information💡

The hackathon week will occur from July 31 - August 7th.
There are two types of potential submitted projects: early warning system and data visualization.
The submission deadline is Monday, August 7, 2023
Join our Discord channel to get full support.
The winners will be announced on August 21, 2023

Sign up 🔥

Join 4500+ others and sign up for our data engineering newsletter.

Introducing Memphis.dev Cloud: Empowering Developers with the Next Generation of Streaming

avital trifsik — Mon, 03 Jul 2023 13:30:37 +0000

Event processing innovator Memphis.dev today introduced Memphis Cloud to enable a full serverless experience for massive scale event streaming and processing, and announced it had secured $5.58 million in seed funding co-led by Angular Ventures and boldstart ventures, with participation from JFrog co-founder and CTO Fred Simon, Snyk co-founder Guy Podjarny, CircleCI CEO Jim Rose, Console.dev co-founder David Mytton, and Priceline CTO Martin Brodbeck.

Introducing Memphis.dev Cloud

Memphis.dev, the next-generation event streaming platform, is ready to make waves in the world and disrupt data streaming with its highly anticipated cloud service launch.
With a firm commitment to providing developers and data engineers with a powerful and unified streaming engine, Memphis.dev aims to revolutionize the way software is utilizing a message broker. In this blog post, we delve into the key features and benefits of Memphis.dev’s cloud service, highlighting how it empowers organizations and developers to unleash the full potential of their data.

What to expect?

1. The Serverless Experience
Memphis’ platform was intentionally designed to be deployed in minutes, on any Kubernetes, on any cloud, both on-prem, public cloud, or even in air-gapped environments.
In advance of the rising multi-cloud architecture, Memphis enables streamlining development between the local dev station all the way to the production, both on-prem and through various clouds, and to reduce TCO and overall complexity – the serverless cloud will enable an even faster onboarding and time-to-value.

2. Enable “Day 2” operations
Message brokers need to evolve to handle the vast amount and complexity of events that occur, and they need to incorporate three critical elements: reliable; ease to manage and scale; and, offer what we call the “Day 2” operations on top to help build queue-based, stream-driven applications in minutes.

To support both the small and the massive scale and workloads Memphis is built for, some key features were only enabled to be delivered via the cloud.

Key features in Memphis Cloud include:

Augmenting Kafka clusters – providing the missing piece in modern stream processing with the ability to augment Kafka clusters;
Schemaverse – Enabling built-in schema management, enforcement, and transformation to ensure data quality as our data gets complicated and branched;
Multi-tenancy – Offering the perfect solution for users of SaaS platforms who want to isolate traffic between their customers;
True multi-cloud – creating primary instances on GCP, and a replica on AWS.

3. A Developer-Centric Approach (and obsession
Memphis.dev’s cloud service launch is driven by a developer-centric philosophy, recognizing that developers are the driving force behind technological innovation. With a deep understanding of developers’ and data engineers’ needs, especially in the current era of much more complicated pipelines with much fewer hands, Memphis.dev has created a comprehensive suite of out-of-the-box tools and features tailored specifically to enhance productivity, streamline workflows, and facilitate collaboration. By prioritizing the developer experience, Memphis.dev aims to empower developers to focus on what they do best: writing exceptional code, and extracting value out of their data!

4. No code changes. Open-source to Cloud.
Fully aligned development experience between the open-source and the cloud. No code changes are needed, nor an application config modification.
The cloud does reveal an additional parameter to add and is not mandatory, which is an account id.

5. Enhanced Security and Compliance:
Memphis.dev prioritizes the security and compliance of its cloud service, recognizing the critical importance of protecting sensitive data. With robust security measures, including data encryption, role-based identity and access management, integration with 3rd party identity managers, and regular security audits, Memphis.dev ensures that developers’ applications and data are safeguarded. By adhering to industry-standard compliance frameworks, Memphis.dev helps developers meet regulatory requirements and build applications with confidence.

6. Support and Success
The core support and customer service ethos of Memphis.dev is customer success and enablement. A successful customer is a happy customer, and we are working hard to support our customers not just with Memphis, but with their bigger picture and data journey. Three global customer support teams, spread across three different timezones alongside highly experienced data engineers and data architects that are positioned as Customer Success Engineers and willing to dive into the internals of our customers and help them achieve their goals.

“Cluster setup, fault tolerance, high availability, data replication, performance tuning, multitenancy, security, monitoring, and troubleshooting all are headaches everyone who has deployed traditional message broker platforms is familiar with,” said Torsten Volk, senior analyst, EMA. “Memphis however is incredibly simple so that I had my first Python app sending and receiving messages in less than 15 minutes.”

“The world is asynchronous and built out of events. Message brokers are the engine behind their flow in the modern software architecture, and when we looked at the bigger picture and the role message brokers play, we immediately understood that the modern message broker should be much more intelligent and by far with much less friction,” said Yaniv Ben Hemo, co-founder and CEO, Memphis. “With that idea, we built Memphis.dev which takes five minutes on average for a user to get to production and start building queue-based applications and distributed streaming pipelines.”

Join 4500+ others and sign up for our data engineering newsletter.

Originally published at Memphis.dev By Yaniv Ben Hemo, Co-Founder & CEO at Memphis.dev

Part 4: Validating CDC Messages with Schemaverse

avital trifsik — Thu, 22 Jun 2023 10:58:53 +0000

This is part four of a series of blog posts on building a modern event-driven system using Memphis.dev.

In the previous two blog posts (part 2 and part 3), we described how to implement a change data capture (CDC) pipeline for MongoDB using Debezium Server and Memphis.dev.

Schema on Write, Schema on Read

With relational databases, schemas are defined before any data are ingested. Only data that conforms to the schema can be inserted into the database. This is known as “schema on write.” This pattern ensures data integrity but can limit flexibility and the ability to evolve a system.

Predefined schemas are optional in NoSQL databases like MongoDB. MongoDB models collections of objects. In the most extreme case, collections can contain completely different types of objects such as cats, tanks, and books. More commonly, fields may only be present on a subset of objects or the value types may vary from one object to another. This flexibility makes it easier to evolve schemas over time and efficiently support objects with many optional fields.

Schema flexibility puts more onus on applications that read the data. Clients need to check for any desired field and confirm their data types. This pattern is called "schema on read."

Malformed Records Cause Crashes

In one of my positions earlier in my career, I worked on a team that developed and maintained data pipelines for an online ad recommendation system. One of the most common sources of downtime were malformed records. Pipeline code can fail if a field is missing, an unexpected value is encountered, or when trying to parse badly-formatted data. If the pipeline isn't developed with errors in mind (e.g., using defensive programming techniques, explicitly-defined data models, and validating data), the entire pipeline may crash and require manual intervention by an operator.

Unfortunately, malformed data, especially when handling large volumes of data, is a frequent occurrence. Simply hoping for the best won't lead to resilient pipelines. As the saying goes, "Hope for the best. Plan for the worst."

The Best of Both Worlds: Data Validation with Schemaverse

Fortunately, Memphis.dev has an awesome feature called Schemaverse. Schemaverse provides a mechanism to check messages for compliance with a specified schema and handle non-confirming messages.

To use Schemaverse, the operator needs to first define a schema. Messaged schemas can be defined using JSON Schema, Google Protocol Buffers, or GraphQL. The operator will choose the schema definition language appropriate to the format of the message payloads.

Once a schema is defined, the operator can "attach" the schema to a station. The schema will be downloaded by clients using the Memphis.dev client SDKs. The client SDK will validate each message before sending it to the Memphis broker. If a message doesn't validate, the client will redirect the message to the dead-letter queue, trigger a notification, and raise an exception to notify the user of the client.

In this example, we'll look at using Schemaverse to validate change data capture (CDC) events from MongoDB.

Review of the Solution

In our previous post, we described a change data capture (CDC) pipeline for a collection of todo items stored in MongoDB. Our solution consists of eight components:

Todo Item Generator: Inserts a randomly-generated todo item in the MongoDB collection every 0.5 seconds. Each todo item contains a description, creation timestamp, optional due date, and completion status.
MongoDB: Configured with a single database containing a single collection (todo_items).
Debezium Server: Instance of Debezium Server configured with MongoDB source and HTTP Client sink connectors.
Memphis.dev REST Gateway: Uses the out-of-the-box configuration.
Memphis.dev: Configured with a single station (todo-cdc-events) and single user (todocdcservice).
Printing Consumer: A script that uses the Memphis.dev Python SDK to consume messages and print them to the console.
Transformer Service: A transformer service that consumes messages from the todo-cdc-events station, deserializes the MongoDB records, and pushes them to the cleaned-todo-cdc-events station.
Cleaned Printing Consumer: A second instance of the printing consumer that prints messages pushed to the cleaned-todo-cdc-events station.

In this iteration, we aren't adding or removing any of the components. Rather, we're just going to change Memphis.dev's configuration to perform schema validation on messages sent to the "cleaned-todo-cdc-events" station.

Schema for Todo Change Data Capture (CDC) Events

In part 3, we transformed the messages to hydrate a serialized JSON subdocument to produce fully deserialized JSON messages. The resulting message looked like so:

{
"schema" : ...,

"payload" : {
"before" : null,

"after" : {
"_id": { "$oid": "645fe9eaf4790c34c8fcc2ed" },
"creation_timestamp": { "$date": 1684007402978 },
"due_date": { "$date" : 1684266602978 },
"description": "buy milk",
"completed": false
},

...
}
}

Each JSON-encoded message has two top-level fields, "schema" and "payload." We are concerned with the "payload" field. The payload object has two required fields, "before" and "after", that we are concerned with. The before field contains a copy of the record before being modified (or null if it didn't exist), while the after field contains a copy of the record after being modified (or null if the record is being deleted).

From this example, we can define criteria that messages must satisfy to be considered valid. Let's write the criteria out as a set of rules:

The payload/before field may contain a todo object or null.
The payload/after field may contain a todo object or null.
A todo object must have five fields ("_id", "creation_timestamp", "due_date", "description", and "completed").
The creation_timestamp must be an object with a single field ("$date"). The "$date" field must have a positive integer value (Unix timestamp).
The due_date must be an object with a single field ("$date"). The "$date" field must have a positive integer value (Unix timestamp).
The description field should have a string value. Nulls are not allowed.
The completed field should have a boolean value. Nulls are not allowed.

For this project, we'll define the schema using JSON Schema. JSON Schema is a very powerful data modeling language. It supports defining required fields, field types (e.g., integers, strings, etc.), whether fields are nullable, field formats (e.g., date / times, email addresses), and field constraints (e.g., minimum or maximum values). Objects can be defined and referenced by name, allowing recursive schema and for definitions to be reused. Schema can be further combined using and, or, any, and not operators. As one might expect, this expressiveness comes with a cost: the JSON Schema definition language is complex, and unfortunately, covering it is beyond the scope of this tutorial.

Creating a Schema and Attaching it to a Station

Let's walk through the process of creating a schema and attaching it to a station. You'll first need to complete the first 10 steps from part 2 and part 3.

Step 11: Navigate to the Schemaverse Tab
Navigate to the Memphis UI in your browser. For example, you might be able to find it at https://localhost:9000/ . Once you are signed in, navigate to the Schemaverse tab:

Step 12: Create the Schema
Click the "Create from blank" button to create a new schema. Set the schema name to "todo-cdc-schema" and the schema type to "JSON schema." Paste the following JSON Schema document into the textbox on the right.

{
    "$schema": "https://json-schema.org/draft/2020-12/schema",
    "$id": "https://example.com/product.schema.json",
    "type" : "object",
    "properties" : {
        "payload" : {
            "type" : "object",
            "properties" : {
                "before" : {
                    "oneOf" : [{ "type" : "null" }, { "$ref" : "#/$defs/todoItem" }]
                },
                "after" : {
                    "oneOf" : [{ "type" : "null" }, { "$ref" : "#/$defs/todoItem" }]
                }
            },
            "required" : ["before", "after"]
        }
    },
    "required" : ["payload"],
   "$defs" : {
      "todoItem" : {
          "title": "TodoItem",
          "description": "An item in a todo checklist",
          "type" : "object",
          "properties" : {
              "_id" : {
                  "type" : "object",
                  "properties" : {
                      "$oid" : {
                          "type" : "string"
                      }
                  }
              },
              "description" : {
                  "type" : "string"
              },
              "creation_timestamp" : {
                  "type" : "object",
                  "properties" : {
                      "$date" : {
                          "type" : "integer"
                      }
                  }
              },
              "due_date" : {
                    "anyOf" : [
                        {
                            "type" : "object",
                            "properties" : {
                                "$date" : {
                                    "type" : "integer"
                                }
                            }
                        },
                        {
                            "type" : "null"
                        }
                    ]
              },
              "completed" : {
                  "type" : "boolean"
              }
          },
          "required" : ["_id", "description", "creation_timestamp", "completed"]
      }
  }
}

When done, your window should look like so:

When done, click the "Create schema" button. Once the schema has been created, you'll be returned to the Schemaverse tab. You should see an entry for the newly created schema like so:

Step 13: Attach the Schema to the Station
Once the schema is created, we want to attach the schema to the "cleaned-todo-cdc-events" station. Double-click on the "todo-cdc-schema" window to bring up its details window like so:

Next, click on the "+ Attach to Station" button. This will bring up the following window:

Select the "cleaned-todo-cdc-events" station, and click "Attach Selected." The producers attached to the station will automatically download the schema and begin validating outgoing messages within a few minutes.

Step 14: Confirm that Messages are Being Filtered
Navigate to the station overview page for the "cleaned-todo-cdc-events" station. After a couple of minutes, you should see a red warning notification icon next to the "Dead-letter" tab name.

If you click on the "Dead-letter" tab and then the "Schema violation" subtab, you'll see the messages that failed the schema validation. These messages have been re-routed to the dead letter queue so that they don't cause bugs in the downstream pipelines. The window will look like so:

Congratulations! You're now using Schemaverse to validate messages. This is one small but incredibly impactful step towards making your pipeline more reliable.

In case you missed parts 1,2 and 3:
Part 3: Transforming MongoDB CDC Event Messages

Part 2: Change Data Capture (CDC) for MongoDB with Debezium and Memphis.dev

Part 1: Integrating Debezium Server and Memphis.dev for Streaming Change Data Capture (CDC) Events

Originally published at Memphis.dev By RJ Nowling, Developer advocate at Memphis.dev

Part 3: Transforming MongoDB CDC Event Messages

avital trifsik — Tue, 06 Jun 2023 10:26:31 +0000

This is part three of a series of blog posts on building a modern event-driven system using Memphis.dev.

In our last blog post, we introduced a reference implementation for capturing change data capture (CDC) events from a MongoDB database using Debezium Server and Memphis.dev. At the end of the post we noted that MongoDB records are serialized as strings in Debezium CDC messages like so:

{
    "schema" : ...,

"payload" : {
"before" : null,

"after" : "{\\"_id\\": {\\"$oid\\": \\"645fe9eaf4790c34c8fcc2ed\\"},\\"creation_timestamp\\": {\\"$date\\": 1684007402978},\\"due_date\\": {\\"$date\\": 1684266602978},\\"description\\": \\"buy milk\\",\\"completed\\": false}",

...
}
}

We want to use the Schemaverse functionality of Memphis.dev to check messages against an expected schema. Messages that don’t match the schema are routed to a dead letter station so that they don’t impact downstream consumers. If this all sounds like ancient Greek, don’t worry! We’ll explain the details in our next blog post.

To use functionality like Schemaverse, we need to deserialize the MongoDB records as JSON documents. In this blog post, we describe a modification to our MongoDB CDC pipeline that adds a transformer service to deserialize the MongoDB records to JSON documents.

Overview of the Solution

The previous solution consisted of six components:

Todo Item Generator: Inserts a randomly-generated todo item in the MongoDB collection every 0.5 seconds. Each todo item contains a description, creation timestamp, optional due date, and completion status.
MongoDB: Configured with a single database containing a single collection (todo_items).
Debezium Server: Instance of Debezium Server configured with MongoDB source and HTTP Client sink connectors.
Memphis.dev REST Gateway: Uses the out-of-the-box configuration.
Memphis.dev: Configured with a single station (todo-cdc-events) and single user (todocdcservice).
Printing Consumer: A script that uses the Memphis.dev Python SDK to consume messages and print them to the console.

In this iteration, we are adding two additional components:

Transformer Service: A transformer service that consumes messages from the todo-cdc-events station, deserializes the MongoDB records, and pushes them to the cleaned-todo-cdc-events station.
Cleaned Printing Consumer: A second instance of the printing consumer that prints messages pushed to the cleaned-todo-cdc-events station.

Our updated architecture looks like this:

A Deep Dive Into the Transformer Service

Skeleton of the Message Transformer Service

Our transformer service uses the Memphis.dev Python SDK. Let’s walk through the transformer implementation. The main() method of our transformer first connects to the Memphis.dev broker. The connection details are grabbed from environmental variables. The host, username, password, input station name, and output station name are passed using environmental variables in accordance with suggestions from the Twelve-Factor App manifesto.

async def main():
    try:
        print("Waiting on messages...")
        memphis = Memphis()
        await memphis.connect(host=os.environ[HOST_KEY],
                              username=os.environ[USERNAME_KEY],
                              password=os.environ[PASSWORD_KEY])

Once a connection is established, we create consumer and producer objects. In Memphis.dev, consumers and producers have names. These names appear in the Memphis.dev UI, offering transparency into the system operations.

print("Creating consumer")
        consumer = await memphis.consumer(station_name=os.environ[INPUT_STATION_KEY],
                                          consumer_name="transformer",
                                          consumer_group="")

        print("Creating producer")
        producer = await memphis.producer(station_name=os.environ[OUTPUT_STATION_KEY],
                                          producer_name="transformer")

The consumer API uses the callback function design pattern. When messages are pulled from the broker, the provided function is called with a list of messages as its argument.

  print("Creating handler")
        msg_handler = create_handler(producer)

        print("Setting handler")
        consumer.consume(msg_handler)

After setting up the callback, we kick off the asyncio event loop. At this point, the transformer service pauses and waits until messages are available to pull from the broker.

Keep your main thread alive so the consumer will keep receiving data

await asyncio.Event().wait()

Creating the Message Handler Function

The create function for the message handler takes a producer object and returns a callback function. Since the callback function only takes a single argument, we use the closure pattern to implicitly pass the producer to the msg_handler function when we create it.

The msg_handler function is passed three arguments when called: a list of messages, an error (if one occurred), and a context consisting of a dictionary. Our handler loops over the messages, calls the transform function on each, sends the messages to the second station using the producer, and acknowledges that the message has been processed. In Memphis.dev, messages are not marked off as delivered until the consumer acknowledges them. This prevents messages from being dropped if an error occurs during processing.

def create_handler(producer):
    async def msg_handler(msgs, error, context):
        try:
            for msg in msgs:
                transformed_msg = deserialize_mongodb_cdc_event(msg.get_data())
                await producer.produce(message=transformed_msg)
                await msg.ack()
        except (MemphisError, MemphisConnectError, MemphisHeaderError) as e:
            print(e)
            return

    return msg_handler

The Message Transformer Function

Now, we get to the meat of the service: the message transformer function. Message payloads (returned by the get_data() method) are stored as bytearray objects. We use the Python json library to deserialize the messages into a hierarchy of Python collections (list and dict) and primitive types (int, float, str, and None).

def deserialize_mongodb_cdc_event(input_msg):
    obj = json.loads(input_msg)

We expect the object to have a payload property with an object as the value. That object then has two properties (“before” and “after”) which are either None or strings containing serialized JSON objects. We use the JSON library again to deserialize and replace the strings with the objects.

 if "payload" in obj:
        payload = obj["payload"]

        if "before" in payload:
            before_payload = payload["before"]
            if before_payload is not None:
                payload["before"] = json.loads(before_payload)

        if "after" in payload:
            after_payload = payload["after"]
            if after_payload is not None:
                payload["after"] = json.loads(after_payload)

Lastly, we reserialize the entire JSON record and convert it back into a bytearray for transmission to the broker.

  output_s = json.dumps(obj)
    output_msg = bytearray(output_s, "utf-8")
    return output_msg

Hooray! Our objects now look like so:

{
"schema" : ...,

"payload" : {
"before" : null,

"after" : {
"_id": { "$oid": "645fe9eaf4790c34c8fcc2ed" },
"creation_timestamp": { "$date": 1684007402978 },
"due_date": { "$date" : 1684266602978 },
"description": "buy milk",
"completed": false
},

...
}
}

Running the Transformer Service

If you followed the 7 steps in the previous blog post, you only need to run three additional steps. to start the transformer service and verify that its working:

Step 8: Start the Transformer Service

$ docker compose up -d cdc-transformer
[+] Running 3/3
 ⠿ Container mongodb-debezium-cdc-example-memphis-metadata-1  Hea...                                                             0.5s
 ⠿ Container mongodb-debezium-cdc-example-memphis-1           Healthy                                                            1.0s
 ⠿ Container cdc-transformer                                  Started                                                            1.3s

Step 9: Start the Second Printing Consumer

$ docker compose up -d cleaned-printing-consumer
[+] Running 3/3
 ⠿ Container mongodb-debezium-cdc-example-memphis-metadata-1  Hea...                                                             0.5s
 ⠿ Container mongodb-debezium-cdc-example-memphis-1           Healthy                                                            1.0s
 ⠿ Container cleaned-printing-consumer                        Started                                                            1.3s

Step 10: Check the Memphis UI

When the transformer starts producing messages to Memphis.dev, a second station named "cleaned-todo-cdc-events" will be created. You should see this new station on the Station Overview page in the Memphis.dev UI like so:

The details page for the "cleaned-todo-cdc-events" page should show the transformer attached as a producer, the printing consumer, and the transformed messages:

Congratulations! We’re now ready to tackle validating messages using Schemaverse in our next blog post. Subscribe to our newsletter to stay tuned!

Head over to Part 4: Validating CDC Messages with Schemaverse to learn further

Apache Kafka vs Memphis

avital trifsik — Tue, 30 May 2023 15:12:59 +0000

What is Apache Kafka?

Apache Kafka is an open-source distributed event streaming platform. Based on the abstraction of a distributed commit log, Kafka can handle a great number of events with functionality comprising pub/sub.

What is Memphis.dev?

Memphis is a next-generation message broker.
A simple, robust, and durable cloud-native message broker wrapped with an entire ecosystem that enables fast and reliable development of next-generation event-driven use cases.

Memphis.dev enables building next-generation applications that require large volumes of streamed and enriched data, modern protocols, zero ops, rapid development, extreme cost reduction, and a significantly lower amount of dev time for data-oriented developers and data engineers.

General

License

Both technologies are available under fully open-source licenses. Memphis also has a commercial distribution with added security, tiered storage, and more.

Components

Kafka uses Apache Zookeeper™ for consensus and message storage.
Memphis uses PostgreSQL for GUI state management only and will be removed soon, making Memphis without any external dependency. Memphis achieves consensus by using RAFT.

Message Consumption Model

Both Kafka and Memphis use a pull-based architecture where consumers pull messages from the server, and long-polling is used to ensure new messages are made available instantaneously.

Pull-based architectures are often preferable for high throughput workloads as they allow consumers to manage their flow control, fetching only what they need.

Storage Architecture

Kafka uses a distributed commit log as its storage layer. Writes are appended to the end of the log. Reads are sequential, starting from an offset, and data is zero-copied from the disk buffer to the network buffer. This works well for event streaming use cases.

Memphis also uses a distributed commit log called streams (made by NATS Jetstream) as its storage layer, which can be written entirely on the broker's (server) memory or disk.
Memphis also uses offsets but abstracts them completely, so the heavy lifting of saving a record of the used offsets resides on Memphis and not on the client.
Memphis also offers storage tiering for offloading messages to S3-compatible storage for an infinite storage time and more cost-effective storage. Reads are sequential.

Ecosystem and User Experience

Deployment

Kafka is a cluster-based technology with a medium-weight architecture requiring two distributed components: Kafka's own servers (brokers) plus ZooKeeper™ servers. Zookeeper adds an additional level of complexity but the community is in the process of removing the ZooKeeper component from Kafka. Kafka "Vanilla" deployment requires a manual binary installation and text-based configuration, as well as config OS daemons and internal parameters.

Memphis has a light-weight yet robust cloud-native architecture and packed as a container from day one. It can be deployed using any docker engine, docker swarm, and for production environment using helm for Kubernetes (soon with operator). Memphis initial config is already sufficient for production, and optimizations can take place on-the-fly without downtime. That approach enables Memphis to be completely isolated and apart from the infrastructure it deployed upon.

Enterprise support and managed cloud offering

Enterprise-grade support and managed cloud offerings for Kafka are available from several prominent vendors, including Confluent, AWS (MSK), Cloudera, and more.

Memphis provides enterprise support and managed cloud offering that includes features like enhanced security, stream research abilities, an ML-based resource scheduler for better cost savings, and more.

Self-healing

Kafka is a robust distributed system and requires constant tune-ups, client-made wrappers, management, and tight monitoring. The user or operator is responsible for ensuring it's alive and works as required. This approach has pros and cons, as the user can tune almost every parameter, which is often revealed as a significant burden.

One of Memphis' core features is to remove frictions of management and autonomously make sure it's alive and performing well using periodic self-checks and proactive rebalancing tasks, as well as fencing the users from misusing the system. In parallel, every aspect of the system can be configured on-the-fly without downtime.

Notifications

Memphis has a built-in notification center that can push real-time alerts based on defined triggers like client disconnections, resource depletion, schema violation, and more.

Apache Kafka does not offer an embedded solution for notifications. Can be achieved via commercial offerings.

Message tracing (aka Stream lineage)

Tracking stream lineage is the ability to understand the full path of a message from the very first producer through the final consumer, including the trail and evolvement of a message between topics. This ability is extremely handy in a troubleshooting process.

Apache Kafka does not provide a native ability for stream lineage, but it can be achieved using OpenTelemetry or OpenLineage frameworks, as well as integrating 3rd party applications such as datadog, epsagon, or using Confluent's cloud offering.

Memphis provides stream lineage per message with out-of-the-box visualization for each stamped message using a generated header by the Memphis SDK.

Availability and Messaging

Mirroring (Replication)

Kafka Replication means having multiple copies of the data spread across multiple servers/brokers. This helps maintain high availability if one of the brokers goes down and is unavailable to serve the requests.

Memphis station replication works similarly. During station (=topic) creation, the user can choose the number of replicas derived from the number of available brokers. Messages will be replicated in a RAID-1 manner across the chosen number of brokers.

Multi-tenancy

Multi-tenancy refers to the mode of operation of software where multiple independent instances of one or multiple applications operate in a shared environment. The instances (tenants) are logically isolated and often physically integrated. The most famous users are SaaS-type applications.

Apache Kafka does not natively support multi-tenancy. It can be achieved via complex client logic, different topics, and ACL.

As Memphis pushes to enable the next generation of applications and especially SaaS-type architectures, Memphis supports multi-tenancy across all the layers from stations (=topics) to security, consumers, and producers, all the way to node selection for complete hardware isolation in case of need. It is enabled using namespaces and can be managed in a unified console.

Storage tiering

Memphis offers a multi-tier storage strategy in its open-source version. Memphis will write messages that reached their end of 1st retention policy to a 2nd retention policy on object storage like S3 for longer retention time, potentially infinite, and post-streaming analysis. This feature can significantly help with cost reduction and stream auditing.

Permanent storage

Both Kafka and Memphis store data durably and reliably, much like a normal database. Data retention is user configurable per Memphis station or Kafka topic.

Idempotency

Both Kafka and Memphis provide default support in idempotent producers.
On the consumer side, in Kafka, it׳s the client's responsibility to build a retry mechanism that will retransmit a batch of messages exactly once, while in Memphis, it is provided natively within the SDK with a parameter called maxMsgDeliveries.

Geo-Replication (Multi-region)

Common scenarios for a geo-replication include:

Geo-replication
Disaster recovery
Feeding edge clusters into a central, aggregate cluster
Physical isolation of clusters (such as production vs. testing)
Cloud migration or hybrid cloud deployments
Legal and compliance requirements

Kafka users can set up such inter-cluster data flows with Kafka's MirrorMaker (version 2), a tool to replicate data between different Kafka environments in a streaming manner.

Memphis cloud users can create more Memphis clusters and form a supercluster that replicates data in an async manner between the clusters of streamed data, security, consumer groups, unified management, and more.

Features

GUI

Multiple open-source GUIs have been developed for Kafka over the years, for example, Kafka-UI. Usually, it cannot sustain heavy traffic and visualization and requires separate computing and maintenance. There are different commercial versions of Kafka that, among the rest, provide robust GUI, like Confluent, Conduktor, and more.

Memphis provides a native state-of-the-art GUI, hosted inside the broker, built to act as a management layer of all Memphis aspects, including cluster config, resources, data observability, notifications, processing, and more.

Dead-letter Queue

Dead-letter queue is both a concept and a solution that is useful for debugging clients because it lets you isolate and "recycle" instead of drop unconsumed messages to determine why their processing doesn't succeed.

The Kafka architecture does not support DLQ within the broker; it is the client or consumer's responsibility to implement such behavior for good and bad.

One of Memphis' core building blocks is avoiding unexpected data loss, enabling rapid development, and shortening troubleshooting cycles. Therefore, Memphis provides a native solution for dead-letter that acts as the station recycle bin for various failures such as unacknowledged messages, schema violations, and custom exceptions.

Schema Management

The very basic building block to control and ensure the quality of data that flows through your organization between the different owners is by defining well-written schemas and data models.

Confluent offers "Schema Registry" which is a standalone component and provides a centralized repository for schemas and metadata, allowing services to flexibly interact and exchange data with each other without the challenge of managing and sharing schemas between them. It requires dedicated management, maintenance, scale, and monitoring.

As part of its open-source version, Memphis presents Schemaverse, which is also embedded within the broker. Schemaverse provides a robust schema store and schema management layer on top of memphis broker without a standalone compute or dedicated resources. With a unique and modern UI and programmatic approach, technical and non-technical users can create and define different schemas, attach the schema to multiple stations and choose if the schema should be enforced or not. In counter to Schema Registry, the client does not need to implement serialization functions, and every schema update takes place during producers' runtime.

Message routing

Kafka provides routing capabilities through Kafka Connect and Kafka Streams, including content-based routing, message transformation, and message enrichment.

Memphis message routing is similar to the implementation of RabbitMQ using routing keys, wildcards, content-based routing, and more. Similar to RabbitMQ, it is also embedded within the broker and does not require external libraries or tools.

Log compaction

Compaction has been created to support a long-term, potentially infinite record store based on specific keys.

Kafka supports native topic compaction, which runs on all brokers. This runs automatically for compacted topics, condensing the log down to the latest version of messages sharing the same key.

At the moment, Memphis does not support compaction, but it will in the future.

Message replay

The ability to re-consume committed messages.

Kafka does support replay by seeking specific offsets as the consumers have control over resetting the offset.

Memphis does not support replay yet but will in the near future (2023).

Stream Enrichment

Kafka, with its Kafka Streams library, allows developers to implement elastic and scalable client applications that can leverage essential stream processing features such as tables, joins, and aggregations of several topics, and export to multiple sources via Kafka connect.

Memphis provides a similar behavior and more. Embedded inside the broker, Memphis users can create serverless-type functions or complete containerized applications that aggregate several stations and streams, decorate and enrich messages from different sources, write complex functions that cannot be achieved via SQL, and manipulate the schema. Memphis embedded connectors frameworks will help to push the results directly to a defined sink.

Pull retry mechanism

In case of a failure or lack of ability to acknowledge consumed messages, there should be a retry mechanism that will retry to pull the same offset or batch of messages.

In Kafka, it is the client's responsibility to implement one. Some key factors must be considered to implement such a mechanism, like blocking vs non-blocking, offset tracking, idempotency, and more.

In Memphis, the retry mechanism is built-in and turned on by default within the SDK and broker. During consumer creation, the parameter maxMsgDeliveries will determine the number of retries the station will deliver a message if an acknowledgment does not arrive till maxAckTimeMs . The broker itself records the offsets given and will expose only the unacknowledged ones to the retry request.

Conclusion

Apache Kafka is a robust and mature system with extensive community support and a proven record for high-throughput use cases and users. Still, it also requires micro-management, troubleshooting can take precious time, complex client implementations, wrappers, and often tunings that take the user's focus and resources from the "main event," and the most important thing - it does not scale well as the organization grows, and more use cases join in. Wix Greyhound library is an excellent proof for the needed work on top of Kafka.

We call Memphis "A next-generation message broker" because it leans towards the user and adapts to its scale and requirements, not the opposite. Most of the wrappers, tunings, management-overhead, and implementations needed from the client in Kafka, are abstract to the users in Memphis, which provides an excellent solution for both the smaller workload use cases and the more robust ones under the same system and with full ecosystem to support it. It has a milage to pass, but the immediate benefits already exist and will continue to evolve.

Sources
https://kafka.apache.org/documentation
https://www.confluent.io
https://www.kai-waehner.de/blog
https://docs.memphis.dev

Join 4500+ others and sign up for our data engineering newsletter.

Part 2: Change Data Capture (CDC) for MongoDB with Debezium and Memphis.dev

avital trifsik — Sun, 28 May 2023 09:41:43 +0000

This is part two of a series of blog posts on building a modern event-driven system using Memphis.dev.

In our last blog post, we introduced a reference implementation for capturing change data capture (CDC) events from a PostgreSQL database using Debezium Server and Memphis.dev. By replacing Apache Kafka with Memphis.dev, the solution substantially reduced the operational resources and overhead – saving money and freeing developers to focus on building new functionality.

PostgreSQL is the only commonly used database, however. Debezium provides connectors for a range of databases, including the non-relational document database MongoDB. MongoDB is popular with developers, especially those working in dynamic programming languages since it avoids the object-relational impedance mismatch. Developers can directly store, query, and update objects in the database.

In this blog post, we demonstrate how to adapt the CDC solution to MongoDB.

Overview of the Solution

Here, we describe the architecture of the reference solution for delivering change data capture events with Memphis.dev. The architecture has not changed from our previous blog post except for the replacement of PostgreSQL with MongoDB.

A Todo Item generator script writes randomly-generated records to MongoDB. Debezium Server receives CDC events from MongoDB and forwards them to the Memphis REST gateway through the HTTP client sink. The Memphis REST gateway adds the messages to a station in Memphis.dev. Lastly, a consumer script polls Memphis.dev for new messages and prints them to the console.

Todo Item Generator: Inserts a randomly-generated todo item in the MongoDB collection every 0.5 seconds. Each todo item contains a description, creation timestamp, optional due date, and completion status.
MongoDB: Configured with a single database containing a single collection (todo_items).
Debezium Server: Instance of Debezium Server configured with MongoDB source and HTTP Client sink connectors.
Memphis.dev REST Gateway: Uses the out-of-the-box configuration.
Memphis.dev: Configured with a single station (todo-cdc-events) and single user (todocdcservice)
P*rinting Consumer*: A script that uses the Memphis.dev Python SDK to consume messages and print them to the console

Getting Started

The implementation tutorial is available in the mongodb-debezium-cdc-example directory of the Memphis Example Solutions repository. Docker Compose will be needed to run it.

Running the Implementation
Build the Docker images for Debezium Server, the printing consumer, and database setup (table and user creation).

Currently, the implementation depends on a pre-release version of Debezium Server for the JWT authentication support. A Docker image will be built directly from the main branch of the Debezium and Debezium Server repositories. Note that this step can take quite a while (~20 minutes) to run. When Debezium Server 2.3.0 is released, we will switch to using the upstream Docker image.

Step 1: Build the Images

$ docker compose build --pull --no-cache

Step 2: Start the Memphis.dev Broker and REST Gateway
Start the Memphis.dev broker and REST gateway. Note that the memphis-rest-gateway service depends on the memphis broker service, so the broker service will be started as well.

$ docker compose up -d memphis-rest-gateway

[+] Running 4/4
 ⠿ Network mongodb-debezium-cdc-example_default                   Created                                                        0.0s
 ⠿ Container mongodb-debezium-cdc-example-memphis-metadata-1      Healthy                                                        6.0s
 ⠿ Container mongodb-debezium-cdc-example-memphis-1               Healthy                                                       16.8s
 ⠿ Container mongodb-debezium-cdc-example-memphis-rest-gateway-1  Started

Step 3: Create a Station and Corresponding User in Memphis.dev
Messages are delivered to “stations” in Memphis.dev; they are equivalent to “topics” used by message brokers. Point your browser at http://localhost:9000/. Click the “sign in with root” link at the bottom of the page.

Follow the wizard to create a station named todo-cdc-events.

Create a user named todocdcservice with the same value for the password.

Click “next” until the wizard is finished:

Click “Go to station overview” to go to the station overview page.

Step 4: Start the Printing Consumer
We used the Memphis.dev Python SDK to create a consumer script that polls the todo-cdc-events station and prints the messages to the console.

$ docker compose up -d printing-consumer

[+] Running 3/3
 ⠿ Container mongodb-debezium-cdc-example-memphis-metadata-1  Hea...                                                             0.5s
 ⠿ Container mongodb-debezium-cdc-example-memphis-1           Healthy                                                            1.0s
 ⠿ Container printing-consumer                                Started                                                            1.4s

Step 5: Starting and Configuring MongoDB
To capture changes, MongoDB’s replication functionality must be enabled. There are several steps:

The replica set name must be set. This can be done by passing the name of a replica set on the command-line or in the configuration file. In the Docker Compose file, we run MongoDB with the command-line argument –replSet rs0 to set the replica set name.
When replication is used and authorization is enabled, a common key file must be provided to each replica instance. We generated a key file following the instructions in the MongoDB documentation. We then built an image that extends the official MongoDB image by including the key file.
The replica set needs to be initialized once MongoDB is running. We use a script that configures the instance on startup. The script calls the replSetInitiate command with a list of the IP addresses and ports of each MongoDB instance in the replica set. This command causes the MongoDB instances to communicate with each other and select a leader.

Generally speaking, replica sets are used for increased reliability (high availability). Most documentation that you’ll find describes how to set up a replica set with multiple MongoDB instances. In our case, Debezium’s MongoDB connector piggybacks off of the replication functionality to capture data change events. Although we go through the steps to configure a replica set, we only use one MongoDB instance.

The todo item generator script creates a new todo item every half second. The field values are randomly generated. The items are added to a MongoDB collection named “todo_items.”

In the Docker Compose file, the todo item generator script is configured to depend on the Mongodb instance running in a healthy state and successful completion of the database setup script. By starting the todo item generator script, Docker Compose will also start MongoDB and run the database setup script.

$ docker compose up -d todo-generator

[+] Running 3/3
 ⠿ Container mongodb                 Healthy                                                                                     8.4s
 ⠿ Container mongodb-database-setup  Exited                                                                                      8.8s
 ⠿ Container mongodb-todo-generator  Started                                                                                     9.1s

Step 6: Start the Debezium Server
The last service that needs to be started is the Debezium Server. The server is configured with a source connector for MongoDB and the HTTP Client sink connector through a Java properties file:

debezium.sink.type=http
debezium.sink.http.url=http://memphis-rest-gateway:4444/stations/todo-cdc-events/produce/single
debezium.sink.http.time-out.ms=500
debezium.sink.http.retries=3
debezium.sink.http.authentication.type=jwt
debezium.sink.http.authentication.jwt.username=todocdcservice
debezium.sink.http.authentication.jwt.password=todocdcservice
debezium.sink.http.authentication.jwt.url=http://memphis-rest-gateway:4444/
debezium.source.connector.class=io.debezium.connector.mongodb.MongoDbConnector
debezium.source.mongodb.connection.string=mongodb://db
debezium.source.mongodb.user=root
debezium.source.mongodb.password=mongodb
debezium.source.offset.storage.file.filename=data/offsets.dat
debezium.source.offset.flush.interval.ms=0
debezium.source.topic.prefix=tutorial
debezium.format.key=json
debezium.format.value=json
quarkus.log.console.json=false

Most of the options are self-explanatory. The HTTP client sink URL is worth explaining in detail. Memphis.dev REST gateway expects to receive POST requests with a path in the following format:
/stations/{station}/produce/{quantity}

The {station} placeholder is replaced with the name of the station to send the message to. The {quantity} placeholder is replaced with the value single (for a single message) or batch (for multiple messages).

The message(s) is (are) passed as the payload of the POST request. The REST gateway supports three message formats (plain text, JSON, or protocol buffer). The value (text/, application/json, application/x-protobuf) of the content-type header field determines how the payload is interpreted.

The Debezium Server’s HTTP Client sink produces REST requests that are consistent with these patterns. Requests use the POST verb, each request contains a single JSON-encoded message as the payload, and the content-type header set to application/json. We use todo-cdc-events as the station name and the single quantity value in the endpoint URL to route messages and indicate how the REST gateway should interpret the requests:

http://memphis-rest-gateway:4444/stations/todo-cdc-events/produce/single

The debezium.sink.http.authentication.type=jwt property indicates that the HTTP Client sink should use JWT authentication. The username and password properties are self-evident, but the debezium.sink.http.authentication.jwt.url property deserves some explanation. An initial token is acquired using the /auth/authenticate endpoint, while the authentication is refreshed using the separate /auth/refreshToken endpoint. The JWT authentication in the HTTP Client appends the appropriate endpoint to the given base URL.

Debezium Server can be started with the following command:

$ docker compose up -d debezium-server

Step 7: Confirm the System is Working
Check the todo-cdc-events station overview screen in Memphis.dev web UI to confirm that the producer and consumer are connected and messages are being delivered.

And, print the logs for the printing-consumer container:

message:

bytearray(b'{"schema":{"type":"struct","fields":[{"type":"string","optional":true,"name":"io.debezium.data.Json","version":1,"field":"before"},{"type":"string","optional":true,"name":"io.debezium.data.Json","version":1,"field":"after"},{"type":"struct","fields":[{"type":"array","items":{"type":"string","optional":false},"optional":true,"field":"removedFields"},{"type":"string","optional":true,"name":"io.debezium.data.Json","version":1,"field":"updatedFields"},{"type":"array","items":{"type":"struct","fields":[{"type":"string","optional":false,"field":"field"},{"type":"int32","optional":false,"field":"size"}],"optional":false,"name":"io.debezium.connector.mongodb.changestream.truncatedarray","version":1},"optional":true,"field":"truncatedArrays"}],"optional":true,"name":"io.debezium.connector.mongodb.changestream.updatedescription","version":1,"field":"updateDescription"},{"type":"struct","fields":[{"type":"string","optional":false,"field":"version"},{"type":"string","optional":false,"field":"connector"},{"type":"string","optional":false,"field":"name"},{"type":"int64","optional":false,"field":"ts_ms"},{"type":"string","optional":true,"name":"io.debezium.data.Enum","version":1,"parameters":{"allowed":"true,last,false,incremental"},"default":"false","field":"snapshot"},{"type":"string","optional":false,"field":"db"},{"type":"string","optional":true,"field":"sequence"},{"type":"string","optional":false,"field":"rs"},{"type":"string","optional":false,"field":"collection"},{"type":"int32","optional":false,"field":"ord"},{"type":"string","optional":true,"field":"lsid"},{"type":"int64","optional":true,"field":"txnNumber"},{"type":"int64","optional":true,"field":"wallTime"}],"optional":false,"name":"io.debezium.connector.mongo.Source","field":"source"},{"type":"string","optional":true,"field":"op"},{"type":"int64","optional":true,"field":"ts_ms"},{"type":"struct","fields":[{"type":"string","optional":false,"field":"id"},{"type":"int64","optional":false,"field":"total_order"},{"type":"int64","optional":false,"field":"data_collection_order"}],"optional":true,"name":"event.block","version":1,"field":"transaction"}],"optional":false,"name":"tutorial.todo_application.todo_items.Envelope"},"payload":{"before":null,"after":"{\\"_id\\": {\\"$oid\\": \\"645fe9eaf4790c34c8fcc2ec\\"},\\"creation_timestamp\\": {\\"$date\\": 1684007402475},\\"due_date\\": {\\"$date\\": 1684266602475},\\"description\\": \\"GMZVMKXVKOWIOEAVRYWR\\",\\"completed\\": false}","updateDescription":null,"source":{"version":"2.3.0-SNAPSHOT","connector":"mongodb","name":"tutorial","ts_ms":1684007402000,"snapshot":"false","db":"todo_application","sequence":null,"rs":"rs0","collection":"todo_items","ord":1,"lsid":null,"txnNumber":null,"wallTime":1684007402476},"op":"c","ts_ms":1684007402478,"transaction":null}}')

Format of the CDC Messages

The incoming messages are formatted as JSON. The messages have two top-level fields (schema and payload). The schema describes the record schema (field names and types), while the payload describes the change to the record. The payload object itself contains two fields (before and after) indicating the value of the record before and after the change.

For MongoDB, Debezium Server encodes the record as a string of serialized JSON:

{
"before" : null,

"after" : "{\\"_id\\": {\\"$oid\\": \\"645fe9eaf4790c34c8fcc2ed\\"},\\"creation_timestamp\\": {\\"$date\\": 1684007402978},\\"due_date\\": {\\"$date\\": 1684266602978},\\"description\\": \\"buy milk\\",\\"completed\\": false}"
}

This will have implications on the downstream processing of messages, which we will describe in a future blog post in this series.

Congratulations! You now have a working example of how to capture data change events from a MongoDB database using Debezium Server and transfer the events to Memphis.dev for downstream processing.

Head over to Part 3: Transforming MongoDB CDC Event Messages to learn further.

How to reduce your data traffic by 30% instantly

avital trifsik — Wed, 24 May 2023 14:27:41 +0000

Introduction

The bigger the traffic, the bigger the latency and the higher the cost.
It seems that the global economic situation grounded us a bit and made us go back to basics when every byte of memory counted and every unnecessary line of code was removed. Besides higher latency which usually drives pouring better computing and more costs, we tend to forget we're paying a huge amount of money for the amount of transferred data as well, especially around communication between services.

Let’s take a look at the following scenario-
A single, shared EC2 instance, 2 CPUs with 4GB RAM,
processing 10,000 requests of 1KB per day -> 300,000 on an average month -> 292 GB of transferred data.

If we run those minor numbers with an AWS EC2 calculator, we would get the following invoice:

Compute monthly cost is: $33.29 +
Data transfer cost:

If that 292 GBs is transferred within a region, it will cost $5.84 (14% of the total invoice)
If that 292 GBs is transferred back to the internet, it will cost $26.28 (44% of the total invoice)

Popular formats

Usually, services communicate with each other using one of the following

- JSON.

JSON stands for JavaScript Object Notation and is a text format for storing and transporting data.

When using JSON to send a message, we’ll first have to serialize the object representing the message into a JSON-formatted string, then transmit.

{"sensorId": 32,"sensorValue": 24.5}
This string is 36 characters long, but the information content of the string is only 6 characters long. This means that about 16% of transmitted data is actual data, while the rest is metadata. The ratio of useful data in the whole message is increased by decreasing key length or increasing value size, for example, when using a string or array.

- Protobuf.
Protocol Buffers are a language-neutral, platform-neutral extensible mechanism for serializing structured data.

Protocol Buffers use a binary format to transfer messages.
Using Protocol Buffers in your code is slightly more complicated than using JSON.

The user must first define a message using the .proto file. This file is then compiled using Google’s protoc compiler, which generates source files that contain the Protocol Buffer implementation for the defined messages.

This is how our message would look in the .proto definition file:

message TemperatureSensorReading {
    optional uint32 sensor_id = 1;
    optional float sensor_value = 2;
}

When serializing the message from the example above, it’s only 7 bytes long. This can be confusing initially because we would expect uint32 and float to be 8 bytes long when combined. However, Protocol Buffers won’t use all 4 bytes for uint32 if they can encode the data in fewer bytes. In this example, the sensor_id value can be stored in 1 byte. It means that in this serialized message, 1 byte is metadata for the first field, and the field data itself is only 1 byte long. The remaining 5 bytes are metadata and data for the second field; 1 byte for metadata and 4 bytes for data because float always uses 4 bytes in Protocol Buffers. This gives us 5 bytes or 71% of actual data in a 7-byte message.

The main difference between the two is that JSON is just text, while Protocol Buffers are binary. This difference has a crucial impact on the size and performance of moving data between different devices.

Benchmark

In this benchmark, we will take the same message structure and examine the size difference, as well as the network performance.

We used Memphis schemaverse to act as our destination and a simple Python app as the sender.

The gaming industry use cases are usually considered to have a large payload, and to demonstrate the massive savings between the different formats. We used one of “Blizzard” example schemas.
The full used .proto can be found here.
Each packet weights 959.55KiB on average

source

As can be seen, the average savings are between 618.19% to 807.93%!!!

Another key aspect to take under consideration would be the additional step of Serialization/Deserialization.

One more key aspect to take under consideration is the serialization function and its potential impact on performance, as it is, in fact, a similar action to compression.

Quick side-note. Memphis Schemaverse eliminates the need to implement Serialization/Deserialization functions, but behind the scenes, a set of conversion functions will take place.

Going back to serialization. The tests were performed using a Macbook Pro M1 Pro and 16 GB RAM using Google.Protobuf.JsonFormatter and Google.Protobuf.JsonParser Python serialization/deserialization used google.protobuf.json_format

Summary

This comparison is not about articulating which is better. Both formats have their own strengths, but if we go from “the end” and ask ourselves what are the most important parameters for our use case, and both low latency and small footprint are there, then protobuf would be a suitable choice.
If the added complexity is a drawback, I highly recommend checking out Schemaverse and how it eliminates most of the heavy lifting when JSON is the used format, but the benefits of protobuf are appealing.

Resources
https://infinum.com/blog/json-vs-protocol-buffers/
https://levelup.gitconnected.com/protobuf-a-high-performance-data-interchange-format-64eaf7c82c0d

Join 4500+ others and sign up for our data engineering newsletter.

Originally published at memphis.dev by Sveta Gimpelson Co-founder & VP of Data & Research at Memphis.dev.

Part 1: Integrating Debezium Server and Memphis.dev for Streaming Change Data Capture (CDC) Events

avital trifsik — Thu, 18 May 2023 08:17:28 +0000

This is part one of a series of blog posts on building a modern event-driven system using Memphis.dev.

Change data capture (CDC) is an architectural pattern which turns databases into sources for event-driven architectures. Frequently, CDC is implemented on top of built-in replication support. Changes to data (e.g., caused by INSERT, UPDATE, and DELETE statements) are recorded as atomic units and appended to a replication log for transmission to replicas. CDC software copies the events from the replica logs to streaming infrastructure for processing by downstream components.

So what do CDC events look like? In this tutorial, we’ll use the example of a table of todo items with the following fields:

A null value for the due date signifies that there is no due date.

If a user creates a todo item to buy milk from the store, the corresponding CDC event would look like:

{
“before” : null,

“after” : {
    “id” : 25,
    “description” : “buy milk”,
    “creation_timestamp” : “2023-05-01T16:32:15”,
    “due_date” : “2023-05-02”,
    “completed”: False }
}

If the user then completes (updates) the todo item, the following CDC event would be generated:

{
“before” : {
    “id” : 25,
    “description” : “buy milk”,
    “creation_timestamp” : “2023-05-01T16:32:15”,
    “due_date” : “2023-05-02”,
    “completed”: False },

“after” : {
    “id” : 25,
    “description” : “buy milk”,
    “creation_timestamp” : “2023-05-01T16:32:15”,
    “due_date” : “2023-05-02”,
    “completed”: True }
}

If the user deletes, the original item, the CDC event will look like so:

{
“before” : {
     “id” : 25,
     “description” : “buy milk”,
     “creation_timestamp” : “2023-05-01T16:32:15”,
     “due_date” : “2023-05-02”,
     “completed”: False },

“after” : null
}

The CDC approach is used to support various data analyses that are run in near real-time:

Copying data from one database to another. Modern systems often incorporate multiple storage solutions chosen because of their optimizations for complementary workloads. For example, online transaction processing (OLTP) databases like PostgreSQL are designed to support many concurrent users each performing queries that touch a small amount of data. Online analytical processing (OLAP) databases like Clickhouse are optimized to handle a small number of queries touching a large amount of data. The CDC approach doesn’t require schema changes (e.g., adding update timestamps) and is less resource intensive than approaches like running periodic queries to find new or changed records.
Performing real-time data integration. Some tasks require that data be pulled from multiple data sets and integrated. For example, a user’s clickstream (page view) events may be changed with details of the products they’re browsing to feed a machine learning model. Performing these joins in the production OLTP databases reduces application responsiveness. CDC allows computational heavy actions to be run on dedicated subsystems.
Performing aggregations or window analyses. An OLTP database may only log events such as commercial transactions. Business analysts may want to see the current sum of sales in a given quarter, however. The events captured through CDC can be aggregated in real-time to update dashboards and other data applications.
Performing de-aggregations. For performance reasons, an OLTP database may only store the current state of data like counters. For example, a database may store the number of likes or views of social media posts. Machine learning models often need individual events, however. CDC generates an event for every increase or decrease in the counters, effectively creating a historical time series for downstream analyses.

Implementing the CDC Pattern

Debezium is a popular open-source tool for facilitating CDC. Debezium provides connectors for various open-source and proprietary databases. The connectors listen for data change events and convert the internal representations to common formats such as JSON and Avro. Additionally, some support is provided for filtering and transforming events.

Debezium was originally designed as connectors that run in the Apache Kafka Connect framework. Apache Kafka, unfortunately, has a pretty large deployment footprint for production setups. It is recommended that a minimal production deployment has at least 3 nodes with 64 GB of RAM and 24 cores with storage configured with RAID 10 and at least 3 additional nodes for a separate Apache Zookeeper cluster. Meaning, a minimal production setup of Apache Kafka requires at least 6 nodes. Further, the JVM and operating system need to be tuned significantly to achieve optimal performance.

Many cloud-native systems are divided into microservices that are designed to scale independently. Rather than relying on one large message broker cluster, it’s common for these systems to deploy multiple, small, independent clusters. Memphis.dev is a next-generation, cloud-native message broker with a low resource footprint, minimal operational overhead, and no required performance tuning.

Debezium recently announced the general availability of Debezium Server, a framework for using Debezium connectors without Apache Kafka. Debezium Server runs in a standalone mode. Sink connectors for a wide range of messaging systems are included out of the box.

In this tutorial, we’ll demonstrate how to implement the CDC pattern for PostgreSQL using Debezium Server and Memphis.dev.

The Collaborative Power of Open Source: Interfacing Debezium Server and Memphis.dev

Memphis.dev and Debezium Server are integrated using REST. The Memphis.dev REST gateway provides endpoints for consuming messages, while Debezium Server provides the HTTP client sink for transmitting messages via REST. In our reference solution, Debezium Server makes a POST request to /station/todo-cdc-events/produce/single for each message. The REST interface accepts messages in JSON, text, and binary Protocol Buffer formats.

Unfortunately, we hit a stumbling block while implementing our CDC solution. The Memphis.dev REST gateway uses JSON Web Tokens (JWT) authentication for security, but Debezium Server’s HTTP client didn’t support it. Thanks to the collaborative power of open source, we were able to work with the Debezium developers to add JWT authentication functionality. The user must specify a username, password, and authentication endpoint URL in the Debezium Server configuration file. The server then tracks its authentication state and makes REST requests to perform an initial authorization and refresh that authorization as needed.

With the JWT authentication now in place, Debezium Server can forward CDC events to Memphis.dev. Further, all Debezium Server users, whether or not they are using Memphis.dev, can benefit from this functionality.

Overview of the Solution

Here, we describe a reference solution for delivering change data capture events with Memphis.dev. Debezium Server’s HTTP client sink is used to send the CDC events from a PostgreSQL database to a Memphis.dev instance using the Memphis REST gateway. Our solution has six components:

Todo Item Generator: Inserts a randomly-generated todo item in the PostgreSQL table every 0.5 seconds. Each todo item contains a description, creation timestamp, optional due date, and completion status.
PostgreSQL: Configured with a single database containing a single table (todo_items).
Debezium Server: Instance of Debezium Server configured with PostgreSQL source and HTTP Client sink connectors.
Memphis.dev REST Gateway: Uses the out-of-the-box configuration.
Memphis.dev: Configured with a single station (todo-cdc-events) and single user (todocdcservice).
Printing Consumer: A script that uses the Memphis.dev Python SDK to consume messages and print them to the console.

Running the Implementation

Code repository: Memphis Example Solutions.
Docker Compose will be needed.

Step 1: Build the Docker images for Debezium Server, the printing consumer, and database setup (table and user creation).

Currently, our implementation depends on a pre-release version of Debezium Server for the JWT authentication support. A Docker image will be built directly from the main branch of the Debezium and Debezium Server repositories. Note that this step can take quite a while (~20 minutes) to run. When Debezium Server 2.3.0 is released, we will switch to using the upstream Docker image.

$ docker compose build --pull --no-cache

[+] Building 0.0s (0/1)
[+] Building 0.2s (2/3)
 => [internal] load build definition from Dockerfile             0.0s
[+] Building 19.0s (5/10)
 => [internal] load build definition from Dockerfile             0.0s
[+] Building 19.2s (5/10)
 => [internal] load build definition from Dockerfile             0.0s
 => => transferring dockerfile: 302B                             0.0s
 => [internal] load .dockerignore                                0.0s
 => => transferring context: 2B                                  0.0s
 => [internal] load metadata for docker.io/library/debian:bullseye-slim                           0.3s
 => CACHED [1/6] FROM docker.io/library/debian:bullseye-slim@sha256:9404b05bd09b57c76eccc0c5505b3c88b5feccac808d9b193a4fbac87bb                              0.0s
[+] Building 31.4s (5/10)
[+] Building 32.2s (5/10)
[+] Building 34.2s (5/10)
 => [internal] load .dockerignore                                0.0s
[+] Building 37.6s (11/11) FINISHED
 => [internal] load build definition from Dockerfile             0.0s
 => => transferring dockerfile: 302B                             0.0s
[+] Building 37.7s (5/10)
 => [internal] load .dockerignore                                0.0s
 => => transferring context: 2B                                  0.0s
 => [internal] load build definition from Dockerfile             0.0s
 => => transferring dockerfile: 300B                             0.0s
[+] Building 37.9s (5/10)
 => [internal] load .dockerignore                                0.0s
[+] Building 38.0s (5/10)
[+] Building 38.2s (5/10)
[+] Building 18.9s (4/14)
 => [internal] load build definition from Dockerfile             0.0s
 => => transferring dockerfile: 613B                             0.0s
[+] Building 20.0s (4/14)
[+] Building 65.8s (11/11) FINISHED
 => [internal] load .dockerignore                                0.0s
 => => transferring context: 2B                                  0.0s
 => [internal] load build definition from Dockerfile             0.0s
[+] Building 1207.0s (15/15) FINISHED
 => [internal] load build definition from Dockerfile             0.0s
 => => transferring dockerfile: 613B                             0.0s
 => [internal] load .dockerignore                                0.0s
 => => transferring context: 2B                                  0.0s
 => [internal] load metadata for docker.io/library/debian:bullseye-slim                           0.2s
 => CACHED [1/6] FROM docker.io/library/debian:bullseye-slim@sha256:9404b05bd09b57c76eccc0c5505b3c88b5feccac808d9b193a4fbac87bb                              0.0s
 => [ 2/13] RUN apt update && apt upgrade -y && apt install -y openjdk-11-jdk-headless wget git curl && rm -rf /var/cache/apt/ 49.5s
 => [ 3/13] RUN git clone https://github.com/debezium/debezium   6.0s
 => [ 4/13] WORKDIR /debezium                                    0.1s
 => [ 5/13] RUN ./mvnw clean install -DskipITs -DskipTests     761.4s
 => [ 6/13] RUN git clone https://github.com/debezium/debezium-server debezium-server-build                                            1.1s
 => [ 7/13] WORKDIR /debezium-server-build                       0.0s
 => [ 8/13] RUN ./mvnw package -DskipITs -DskipTests -Passembly372.1s
 => [ 9/13] RUN tar -xzvf debezium-server-dist/target/debezium-server-dist-*.tar.gz -C /   2.0s
 => [10/13] WORKDIR /debezium-server                             0.0s
 => [11/13] RUN mkdir data                                       0.5s
 => exporting to image                                          14.0s => => exporting layers                                          14.0s
 => => writing image sha256:51d987a3bf905f35be87ce649099e76c13277d75c4ac26972868fc9af2617d14                                                                0.0s
 => => naming to docker.io/library/debezium-server               0.0s
[+] Building 41.8s (11/11) FINISHED
 => [internal] load build definition from Dockerfile             0.0s
 => => transferring dockerfile: 302B                             0.0s
 => [internal] load .dockerignore                                0.0s
 => => transferring context: 2B                                  0.0s
 => [internal] load metadata for docker.io/library/debian:bullseye-slim                           0.3s
 => [internal] load build context                                0.0s
 => => transferring context: 39B                                 0.0s
 => CACHED [1/6] FROM docker.io/library/debian:bullseye-slim@sha256:9404b05bd09b57c76eccc0c5505b3c88b5feccac808d9b193a4fbac87bb                              0.0s
 => [2/6] RUN apt update && apt upgrade -y && apt install -y python3 python3-pip && rm -rf /var/cache/apt/*                          33.5s
 => [3/6] WORKDIR /app                                           0.0s
 => [4/6] COPY todo_generator.py /app/                           0.0s
 => [5/6] RUN pip3 install -U pip wheel                          2.0s
 => [6/6] RUN pip3 install psycopg2-binary                       1.1s
 => exporting to image                                           4.9s
 => => exporting layers                                          4.9s
 => => writing image sha256:6424a08a9dedb77b798610a0b87c1c0a0c5f910039d03d673b3cf47ac54c10de                                                                0.0s
 => => naming to docker.io/library/todo-generator                0.0s

Step 2: Start the Memphis.dev broker and REST gateway. Note that the memphis-rest-gateway service depends on the memphis broker service, so the broker service will be started as well.

$ docker compose up -d memphis-rest-gateway

[+] Running 4/4
 ⠿ Network postgres-debezium-cdc-example_default                   Created                                                       0.1s
 ⠿ Container postgres-debezium-cdc-example-memphis-metadata-1      Healthy                                                       6.1s
 ⠿ Container postgres-debezium-cdc-example-memphis-1               Health...                                                    16.9s
 ⠿ Container postgres-debezium-cdc-example-memphis-rest-gateway-1  Started                                                      17.3s

Step 3: Follow the instructions for configuring Memphis.dev with a new station (todo-cdc-events) and user (todocdcservice) using the web UI.

Point your browser at http://localhost:9000/. Click the “sign in with root” link at the bottom of the page.

Follow the wizard to create a station named todo-cdc-events.

Create a user named todocdcservice with the same value for the password.

Click “next” until the wizard is finished:

Click “Go to station overview” to go to the station overview page.

Step 4: Start the printing consumer:

$ docker compose up -d printing-consumer

[+] Running 3/3
 ⠿ Container postgres-debezium-cdc-example-memphis-metadata-1  H...                                                              0.6s
 ⠿ Container postgres-debezium-cdc-example-memphis-1           Healthy                                                           1.1s
 ⠿ Container printing-consumer                                 Started

Start the todo item generator, PostgreSQL database, and Debezium Server:

$ docker compose up -d todo-generator

[+] Running 7/7
 ⠿ Container postgres                                              Healthy                                                       7.9s
 ⠿ Container postgres-debezium-cdc-example-memphis-metadata-1      Healthy                                                       0.7s
 ⠿ Container postgres-debezium-cdc-example-memphis-1               Health...                                                     1.2s
 ⠿ Container postgres-debezium-cdc-example-memphis-rest-gateway-1  Running                                                       0.0s
 ⠿ Container database-setup                                        Exited                                                        6.8s
 ⠿ Container debezium-server                                       Healthy                                                      12.7s
 ⠿ Container todo-generator                                        Started

Note that the todo item generator depends on the other services and will start them automatically. The database-setup container will run once to create the database, tables, and role in PostgreSQL.

Lastly, confirm the system is working. Check the todo-cdc-events station overview screen in the Memphis.dev web UI to confirm that the producer and consumer are connected and messages are being delivered.

And, print the logs for the printing-consumer container:

$ docker logs --tail 2 printing-consumer

message:  bytearray(b'{"schema":{"type":"struct","fields":[{"type":"struct","fields":[{"type":"int32","optional":false,"field":"item_id"},{"type":"string","optional":false,"field":"description"},{"type":"int64","optional":false,"name":"io.debezium.time.MicroTimestamp","version":1,"field":"creation_date"},{"type":"int64","optional":true,"name":"io.debezium.time.MicroTimestamp","version":1,"field":"due_date"},{"type":"boolean","optional":false,"field":"completed"}],"optional":true,"name":"tutorial.public.todo_items.Value","field":"before"},{"type":"struct","fields":[{"type":"int32","optional":false,"field":"item_id"},{"type":"string","optional":false,"field":"description"},{"type":"int64","optional":false,"name":"io.debezium.time.MicroTimestamp","version":1,"field":"creation_date"},{"type":"int64","optional":true,"name":"io.debezium.time.MicroTimestamp","version":1,"field":"due_date"},{"type":"boolean","optional":false,"field":"completed"}],"optional":true,"name":"tutorial.public.todo_items.Value","field":"after"},{"type":"struct","fields":[{"type":"string","optional":false,"field":"version"},{"type":"string","optional":false,"field":"connector"},{"type":"string","optional":false,"field":"name"},{"type":"int64","optional":false,"field":"ts_ms"},{"type":"string","optional":true,"name":"io.debezium.data.Enum","version":1,"parameters":{"allowed":"true,last,false,incremental"},"default":"false","field":"snapshot"},{"type":"string","optional":false,"field":"db"},{"type":"string","optional":true,"field":"sequence"},{"type":"string","optional":false,"field":"schema"},{"type":"string","optional":false,"field":"table"},{"type":"int64","optional":true,"field":"txId"},{"type":"int64","optional":true,"field":"lsn"},{"type":"int64","optional":true,"field":"xmin"}],"optional":false,"name":"io.debezium.connector.postgresql.Source","field":"source"},{"type":"string","optional":false,"field":"op"},{"type":"int64","optional":true,"field":"ts_ms"},{"type":"struct","fields":[{"type":"string","optional":false,"field":"id"},{"type":"int64","optional":false,"field":"total_order"},{"type":"int64","optional":false,"field":"data_collection_order"}],"optional":true,"name":"event.block","version":1,"field":"transaction"}],"optional":false,"name":"tutorial.public.todo_items.Envelope","version":1},"payload":{"before":null,"after":{"item_id":205,"description":"ERJGCHXXOBBGSMOUQSMB","creation_date":1682991115063809,"due_date":null,"completed":false},"source":{"version":"2.3.0-SNAPSHOT","connector":"postgresql","name":"tutorial","ts_ms":1682991115065,"snapshot":"false","db":"todo_application","sequence":"[\\"26715784\\",\\"26715784\\"]","schema":"public","table":"todo_items","txId":945,"lsn":26715784,"xmin":null},"op":"c","ts_ms":1682991115377,"transaction":null}}')
message:  bytearray(b'{"schema":{"type":"struct","fields":[{"type":"struct","fields":[{"type":"int32","optional":false,"field":"item_id"},{"type":"string","optional":false,"field":"description"},{"type":"int64","optional":false,"name":"io.debezium.time.MicroTimestamp","version":1,"field":"creation_date"},{"type":"int64","optional":true,"name":"io.debezium.time.MicroTimestamp","version":1,"field":"due_date"},{"type":"boolean","optional":false,"field":"completed"}],"optional":true,"name":"tutorial.public.todo_items.Value","field":"before"},{"type":"struct","fields":[{"type":"int32","optional":false,"field":"item_id"},{"type":"string","optional":false,"field":"description"},{"type":"int64","optional":false,"name":"io.debezium.time.MicroTimestamp","version":1,"field":"creation_date"},{"type":"int64","optional":true,"name":"io.debezium.time.MicroTimestamp","version":1,"field":"due_date"},{"type":"boolean","optional":false,"field":"completed"}],"optional":true,"name":"tutorial.public.todo_items.Value","field":"after"},{"type":"struct","fields":[{"type":"string","optional":false,"field":"version"},{"type":"string","optional":false,"field":"connector"},{"type":"string","optional":false,"field":"name"},{"type":"int64","optional":false,"field":"ts_ms"},{"type":"string","optional":true,"name":"io.debezium.data.Enum","version":1,"parameters":{"allowed":"true,last,false,incremental"},"default":"false","field":"snapshot"},{"type":"string","optional":false,"field":"db"},{"type":"string","optional":true,"field":"sequence"},{"type":"string","optional":false,"field":"schema"},{"type":"string","optional":false,"field":"table"},{"type":"int64","optional":true,"field":"txId"},{"type":"int64","optional":true,"field":"lsn"},{"type":"int64","optional":true,"field":"xmin"}],"optional":false,"name":"io.debezium.connector.postgresql.Source","field":"source"},{"type":"string","optional":false,"field":"op"},{"type":"int64","optional":true,"field":"ts_ms"},{"type":"struct","fields":[{"type":"string","optional":false,"field":"id"},{"type":"int64","optional":false,"field":"total_order"},{"type":"int64","optional":false,"field":"data_collection_order"}],"optional":true,"name":"event.block","version":1,"field":"transaction"}],"optional":false,"name":"tutorial.public.todo_items.Envelope","version":1},"payload":{"before":null,"after":{"item_id":206,"description":"KXWQYXRWCGSKTBJOJFSX","creation_date":1682991115566896,"due_date":1683250315566896,"completed":false},"source":{"version":"2.3.0-SNAPSHOT","connector":"postgresql","name":"tutorial","ts_ms":1682991115568,"snapshot":"false","db":"todo_application","sequence":"[\\"26715992\\",\\"26715992\\"]","schema":"public","table":"todo_items","txId":946,"lsn":26715992,"xmin":null},"op":"c","ts_ms":1682991115885,"transaction":null}}')

Congratulations! You now have a working example of how to capture and transfer data change events from a PostgreSQL database into Memphis.dev using Debezium Server.

check out part 2:Change Data Capture (CDC) for MongoDB with Debezium and Memphis.dev

Join 4500+ others and sign up for our data engineering newsletter.

Originally published at Memphis.dev By RJ Nowling, Developer advocate at Memphis.dev

Memphis is now GA!

avital trifsik — Wed, 05 Apr 2023 05:49:20 +0000

Memphis is now GA,
and we do not take this title for granted.

Let's start from the beginning.

Struggling with the engineering part of the legacy brokers and queues planted the idea that it needs to be disrupted.
We dagged up more and realized a very rhetorical fact. Still, one that we usually forget – messaging queues and brokers are a means to an end, not the goal itself, and that understanding open a whole new variety of solutions.

The chosen solution was the most challenging one, but we believe, also the right one – a) It has to be open-sourced. b) It can’t be just an intelligent message broker on steroids. c) It has to offer what we call the “Day 2” operations on top to help build queue-based applications in minutes. From the more common ones, which are async communication between microservices, task scheduling, to event-driven applications, event sourcing, data ingestion, system integration, log collecting, and forming a data lake
With that understanding in mind, we formed the vision of Memphis – an intelligent and frictionless message broker that enables an ultra-fast development of queue-based applications for developers and data engineers.

From vision to GA.

Memphis beta version released on May 15th, 2022.
We focused on the foundations of the ecosystem, integrating with NATS internals, designing memphis to run natively on Kubernetes and cloud-native environments, out-of-the-box everything - from monitoring, dead-latter station, schema validation, real-time observability, and more.
With each release, the bug cycles became shorter and smaller, and we, as a team and product, became more intelligent by carefully understanding and listening to our users. By doing that, Memphis reached a solid and stable GA, and not less importantly, suitable for most developers and not just those who share our original challenges.
April 2nd, the GA release of the 1st part of Memphis.
Memphis GA stands for a solid, stable, and secure foundation for the future to come with zero known bugs and ready for production.

Some insights from the last eight months.

Avg time from installation to data ingestion takes 5 minutes.
We grew from 0 to over 5000 deployments.
50 new contributors.
Users report production usage before the GA release.
Use cases from async communication between microservices, event-driven applications, event sourcing, data ingestion, system.
integration, log collecting, security events, to forming a data lake
Schemaverse has been a game changer to many of our users.
The most used SDK is Go.
Cost and simplicity have been major factors in replacing existing tech with Memphis.

The future to come.

I mentioned 1st part, so there is a 2nd part.
Memphis’ 1st part is the storage layer, the message broker with all its benefits as we know it today, and will continue to evolve dramatically over the coming releases. We will also push hard on GitOps, automation enablement, and reconstructing some of the APIs so they can be modular and open for the community to self-implement new ones. Last but not least – multi-tenancy, partitions, read-replicas, and more.

Memphis’ 2nd part is all about helping developers and data engineers build valuable use cases and queue-based applications on top of Memphis. More on that in the coming weeks.

Memphis cloud is right around the corner, but if you prefer to self-host Memphis now - head here.

Join 4500+ others and sign up for our data engineering newsletter.

Originally published at Memphis.dev By Yaniv Ben Hemo, Co-Founder & CEO at Memphis.dev