Forem: Team RudderStack

When To Build vs. Buy Data Pipelines

Team RudderStack — Mon, 11 Apr 2022 18:58:04 +0000

Deciding whether to build or buy a new software is a challenge every engineer has to deal with. In the world of data engineering, building data pipelines in house was a pretty common choice because it only required a few scripts to pipe your data into your data warehouse or data lake. But this is changing rapidly.

As data engineers, we now have to handle dozens of constantly changing data sources, and with the rise of real-time use cases latency matters more than ever. There are many approaches we can take in this new world to develop data infrastructure. If we choose to build our own data pipelines, it leads to data integration systems that are hand-crafted by multiple engineers over a long period of time. Each adding their own special spin to the code base. In the end, most of these data pipeline systems end up looking very similar to a framework that already exists like Airflow.

This is because, at the end of the day, most pipeline systems require several key components:

A scheduler
A meta-database
Tasks
Jobs
A web UI

As engineers, we do have a tendency to approach most of our problems as build vs. buy. However, we don't always weigh the opportunity costs, and sometimes building is not the best option. It's dependent upon overall company goals and where our company is in its analytical maturity cycle. In this article, we will discuss the build vs. buy decision when it comes to event streaming and ETL/ELT pipelines to help your team make the right choice for their next data infrastructure component.

The challenges of building and maintaining data pipelines

Building data infrastructure is a long process, and maintaining it is time-consuming. Even small requests can become arduous to take on. This is amplified if your company works with dozens of data sources, requiring you to maintain all the connectors as the underlying APIs and sources change.

In addition, we are often bombarded with ad hoc requests from other teams while maintaining the current code base. You know the feeling, it's like death by 1,000 cuts. It keeps your urgent and important quadrant full of redundant and uninspiring work, and it keeps you from other more strategically important priorities.

The key takeaway here is that constant maintenance and ad hoc requests can significantly slow down real business impact and introduce scalability challenges, so buying solutions or using managed services can be a good choice for many teams.

Benefits of buying

There are always trade-offs between build and buy. Let's start by talking about some benefits of buying solutions.

Quick turn around - Bought solutions often meet the majority of a company's use cases quickly. After the sales cycle, the only time required is implementation. This means your team can immediately implement new tooling once purchased. Often, you'll have a head start because you've already tested out the tool via a free offering or trial.

Less maintenance - Maintenance cost is an open secret. All solutions, built or bought, have maintenance costs. The difference is between who pays this cost. When you buy a solution, the solution provider shoulders the burden for all maintenance and any technical debt, distributing these costs over their whole customer base. This offloads the burden of maintenance and frees your team to spend time working on ways to add value rather than running the hamster wheel of maintenance.

Don't need to keep up with APIs (In the case of connectors) - Keeping up with connector changes is a big (and really annoying) time suck as a data engineer. This is somewhat connected to maintenance. However, rebuilding connectors is such a staple piece of many data engineers' work that it basically requires its own point. Many tools provide connectors out of the box, shifting the maintenance of keeping up with connectors from the company and to the solution provider.

New features don't need to be built by you - Buying a solution removes the need for your company to try to continue to improve the tool. Instead, all optimization and new feature development is really in the hands of the purchased solution. In this modern era where there are a dozen solutions for just about every problem, competition ensures they are constantly motivated to develop a better product. So, when you buy, you won't need to find funds and the time to develop, maintain, and improve your custom-built solution. Instead, you can push the solution provider to constantly improve their offering.

The challenges that come with buying

Of course, buying is far from a perfect solution. For every benefit you get to buying a great solution, there will be trade-offs. Here are a few:

Less flexibility - Most bought tools are going to limit how much you can edit or modify in terms of functionality. Meaning that if your company has very specific use cases or requirements that the solution doesn't provide, you will need to use some form of workaround.

Less control - Let's say the solution you purchase has all of the functionality that you require now. In the future, if you ever have new use cases or want to make small edits that may just be personal preferences, these may not be possible. You can put in tickets to the tool but since you didn't build the solution, it might be a while until these tickets are addressed. As referenced above, your team isn't responsible for building new features which is great when you don't have an actual development team or time, but, if you have the team, the time, and the need, building gives your team control.

Vendor lock-in - Whenever you pick a tool, built or bought, there is an inevitable amount of lock-in. With a bought solution this can be even more deterring as you are paying a monthly invoice to the vendor, and you may have a multi-year agreement keeping you tied to the vendor for a longer period of time.

Lots of different tools leads to multiple learning curves - Every tool has an unavoidable learning curve - no matter how easy or low code. Here's another way to look at this. There is a broad range of users that know Python. It's a general skill that most programmers understand. Picking a more unique tool that has a smaller user base - even if it's lower code - involves a learning curve. Hiring people that don't know the tool you work with, makes for slow initial development.

Buy or build considerations

Talking through pros and cons doesn't make it that easy to check off whether to build or buy, so how should you think about these decisions? Here are a few points to consider when trying to answer the build vs. buy question for technology.

When To Buy

✅ Your teams' main focus is not building software and they don't have a track record delivering large-scale solutions.

✅ Your team has budget limitations and there are tools that can meet said budget.

✅ You have a tight timeline and need to turn around value quickly.

✅ Your team has limited resources or technical knowledge for the specific solution they would need to build. For example, if you need to build a model to detect fraud, but no one on your team has done it before, it might be time to look for a solution.

When To Build

✅ Your executive team needs a unique function or ability that no solutions currently offer.

✅ You have a bigger scope and vision for the solution and plan to sell it externally.

✅ You don't have a tight timeline (Yeah right).

✅ Your team is proficient in delivering large-scale projects.

So Which Is Right For You? Build Or Buy?

In a fast-moving world where engineering talent costs are increasing, it's important to balance build vs. buy. Yes, there are pros and cons to both building and buying. However, in the age of the modern data stack, the cloud is supplying us with a host of pre-built tools we can test out with a free trial to help us answer this question.

Truthfully, most companies are too busy with other operational needs to fully commit to internally building tools. In my experience in the data management world, even when tools get built, once the original developer leaves, the tool starts to degrade over time. Not to mention, with the average salaries at all forms of tech companies, not just MANGAs rising, it's hard to keep talent for long. Thus, if your company doesn't sell software or your team doesn't have the time, it's worth signing up for that free trial to test drive a "buy" solution because the total cost of building is probably far more than a bought solution. More and more, I'm talking to data engineering teams and solutions architects across the data sphere who are able to do more with less by buying the right data pipeline solutions. Especially as modern SaaS tools are offering increased flexibility for developers.

How RudderStack Core Enabled Us To Build Reverse ETL

Team RudderStack — Tue, 05 Apr 2022 21:05:44 +0000

One of the goals of a customer data platform is to make the movement of data from any source to any destination easy while guaranteeing correctness, reliability, efficiency, and observability. In that sense, reverse ETL is no different, it's just another data pipeline.

In 2019, RudderStack started as a data infrastructure tool supporting event streaming to multiple destinations, including the data warehouse. From the outset, we made the data warehouse (or data lake/lakehouse 🙂) a first class citizen, supplying automated pipelines that allow companies to centralize all of their customer data in the warehouse. It's important not to overlook the impact of this decision, because placing the storage layer at the center and making all the data accessible is key to unlocking a plethora of use cases. But getting the data into the warehouse is basically only useful for analytics. It's getting it back out that enables brand new use cases, and that's where Reverse ETL comes in.

What is Reverse ETL?

Reverse ETL is a new category enabling the automation of brand new business use cases on top of warehouse data by routing said data to cloud SaaS solutions, or operational systems, where sales, marketing, and customer success teams can activate it.

Building pipelines for Reverse ETL comes with a unique set of technical challenges, and that is what this blog is about. I'll detail our engineering journey, how we built RudderStack Reverse ETL, and how Rudderstack Core helped us solve more than half of the challenges we faced. In a way, building this felt like a natural progression for us to bring the modern data stack full circle.

What is RudderStack Core?

RudderStack Core is the engine that ingests, processes, and delivers data to downstream destinations. Main features:

Ingest events at scale
Handle back pressure when destinations are not reachable
Run user defined JavaScript functions to modify events on the fly
Generating reports on deliveries and failures
Guarantees the ordering of events delivered is same as the order in which they are ingested

The technical challenges we faced building Reverse ETL

First, I'll give an eagle eye view of the different stages to building Reverse ETL and the challenges associated with them. Along this stroll, I'll explain how RudderStack Core helped us launch it incrementally, making several big hurdles a piece of cake. I must give major kudos to our founding engineers who built this core in a "think big" way. Their foresight drastically reduced the amount of effort we had to put into designing and building engineering solutions for Reverse ETL.

1. Creating a Reverse ETL pipeline

Out of all the steps, this was the easiest one, though it was still a bit tricky.

1.1 Creating a source

Warehouse source creation gets complicated because of credentials and because of the read and write permissions one needs to maintain transient tables for snapshots and evaluating diffs. It's important to ensure the user can easily provide only the necessary permissions for reverse ETL, so the pipeline tool does not end up with access to more tables in the customer's production than needed or with any unnecessary write access.

This is a tricky problem made harder by the differences between warehouses. We asked ourselves a few key questions when building this:

How can we simplify and streamline the commands and accesses for different warehouses?
How can we help one validate these credentials when creating a source?

In this instance, our control plane enabled us to reuse and build on existing components. This was crucial because we wanted to make validations in a generic way, so they would be reusable as we continue adding more data warehouse and data lake sources. Our team iterated a lot on how to educate users on which permissions are required and why. Check out our documentation on creating a new role and user in Snowflake for an example. We had to work to ensure only relevant validations and errors would show when setting up a source, and we came up with faster ways to run some validations.

As an example, in our first iteration we used Snowflake queries to verify whether the provided credential allowed us to validate the needed schema for RudderStack, so we could read, write, and manage transient tables to it. These queries were scheduled in the normal queue manner by Snowflake, but for some customers it took minutes for these queries to run. So, we found a better solution from Snowflake where SHOW commands do not require a running warehouse to execute. With this new solution, validations complete within a minute or less for all customers. As we built out the reverse ETL source creation flow, the big wins that we adopted from the existing RudderStack Core platform were:

Our WebApp React components' modular designs were re-usable in the UI
We were able to re-use code for managing credentials securely and propagate it to the Reverse ETL system in the data plane
We were able to deliver faster because RudderStack Core allowed us to focus on the user experience and features vs. building infrastructure from the ground up

1.2 Creating a destination

Every data pipeline needs a source and a destination. When it came to creating destinations for Reverse ETL, RudderStack Core really shined. Enabling existing destination integrations from our Event Stream pipelines was straightforward. We built a simple JSON Mapper for translating table rows into payloads and were able to launch our Reverse ETL pipeline with over 100 destinations out of the box. Today the count is over 150 and growing! We're also incrementally adding these destinations to our Visual Data Mapper. For further reading, here's a blog on how we backfilled data into an analytics tool with Reverse ETL and some User Transformations magic.

2. Managing orchestration

The Orchestrator was critical and one of the more challenging systems to build, especially at the scale RudderStack is running. Reverse ETL works like any batch framework similar to ETL. If you're familiar with tools like Apache Airflow, Prefect, Dagster, or Temporal, you know what I'm talking about---the ability to schedule complex jobs across different servers or nodes using DAGs as a foundation.

Of course, you're probably wondering which framework we used to build out this orchestration layer. We did explore these options, but ultimately decided to build our own orchestrator from scratch for a few key reasons:

We wanted a solution that would be easily deployed along with a rudder-server instance, in the same sense that rudder-server is easily deployed by open source customers.
We wanted an orchestrator that could potentially depend on the same Postgres of a rudder-server instance for minimal installation and would be easy to deploy as a standalone service or as separate workers.
We love Go! And we had fun tackling the challenge of building an orchestrator that suits us. In the long run, this will enable us to modify and iterate based on requirements.
Building our own orchestrator makes local development, debuggability and testing much easier than using complex tools like Airflow.
We love open source and would like to contribute a simplified version of RudderStack Orchestrator in the future.

3. Managing snapshots and diffing

Let's consider one simple mode of syncing data: upsert. This means running only updates or new inserts in every scheduled sync. There are two ways to do this:

Marker column: In this method, you define a marker column like updated_at and use this in a query to find updates/inserts since the previous sync ran. There are multiple issues with this approach. First, you have to educate the user to build that column into every table. Second, many times it's difficult to maintain these marker columns in warehouses (for application databases, this is natural, and many times DBs provide this without any extra developer work).
Primary key and diffing: In this method, you define a primary key column and have complex logic for diffing.

We went with the second option. One major reason was that we could run the solution on top of the customer's warehouse to avoid introducing another storage component into the system. Also, the compute power and fast query support in modern warehouses were perfect for solving this with queries and maintaining snapshots and diffs to create transient sync tables.

Hubspot table after incremental sync of new rows:

Sync screen in RudderStack:

Snapshot table view:

\
Now, you might be thinking: "What's the big deal? It's just creating some queries, running them and syncing data?" I wish, but it's not as simple as it looks. Also, this was one of the challenges RudderStack core couldn't help with. Here are a few of the challenges that emerge when you dig deeper into the problem:

Diffing needs to be very extensible, not only for the multiple warehouse sources we already support, but also for integrating with future warehouse and data lake sources.
You have to implement state machine based tasks to handle software or system crashes and any errors that occur across a multitude of dependencies.
You have to maintain record ordering checkpoints during sync to ensure a higher guarantee of delivering exactly once to destinations.
You have to support functionality for pausing and resuming syncs.
You have to handle delivery of records that failed to deliver on the previous sync.

On top of those considerations, there were a number of other interesting problems we found related to memory, choice of CTE vs temporary table, columns data types, structs in BigQuery, and more, but that's another post for another day.

4. Managing syncing, transformations, and delivery to destinations

RudderStack Core significantly shortened the development cycle for syncing, running transformations in the data pipeline, and final delivery to destinations.

In large part, this is because our Reverse ETL and Event Stream pipelines have a lot in common relative to these use cases. In fact, from a source perspective, Reverse ETL pulling from warehouse tables is much simpler than SDK sources, so we were able to have more precise control over ingestion and leverage rudder-server for everything else. Here's what rudder-server took care of:

Destination transformations (mapping payloads to destination API specs)
Calling the right APIs for add, update, delete, and batch APIs if supported
User transformations (custom JavaScript code written by users to modify payloads)
Managing the rate limits of destination APIs (which vary significantly) and providing a back pressure mechanism for Reverse ETL
Handling failed events with retries and providing finally failed events back to Reverse ETL
A mechanism to identify completion of sync tasks
New integrations and feature enhancements (automatically usable by our Reverse ETL pipeline when deployed to RudderStack Core)

Even though the items above were huge wins from RudderStack Core, there were some other interesting problems we had to solve because we use rudder-server as our engine to deliver events. I won't dive into those now, but here's a sample:

It's challenging to deliver events to our multi-node rudder-server in a multi-tenant setup
It's complicated to guarantee event ordering for destinations that require it
We have to respect the rate limits of different destinations and use back pressure mechanisms, so we don't overwhelm rudder-server, all while maintaining fast sync times
Acknowledging completion of a sync run with successful delivery of all records to destination

5. Maintaining pipelines with observability, debuggability, and alerting

Any automated data pipeline needs some level of observability, debugging, and alerting, so that data engineers can take action when there are problems and align with business users who are dependent on the data.

This is particularly challenging with systems like Reverse ETL. Here are the main challenges we had to solve:

Long running processes must account for software crashes, deployments, upgrades, and resource throttling
The system has dependencies on hundreds of destinations, and those destinations have API upgrades, downtime, configuration changes, etc.
Because RudderStack doesn't store data, we have to create innovative ways to accomplish things like observability through things like live debuggers, in-process counts (like sending/succeeded/failures), and reasoning for any errors that are critical

Accounting for software crashes, deployments, upgrades, and resource throttling required a thoughtful design for Reverse ETL, here's how we did it:

State machine: State based systems look simple but are incredibly powerful if designed well. Specifically, if an application crashes, it can resume correctly. Even failed states like failed snapshots can be handled properly by, say, ignoring it for the next snapshot run.
Granular checkpoint: This helps make sure no duplicate events will be sent to destinations. For example, say we send events in a batch of 500 and then checkpoint. The only possibility would be that one entire batch might get sent again if the system restarted or if it happened during deployment as it was sent to rudder-server, but could not checkpoint. On top of this, rudder-server only has to maintain a minimal batch of data to add dedupe logic on top because it doesn't need to save an identifier for all records for a full sync task.
Support for handling shutdown and resuming: Graceful shutdown handling is critical for any application, especially for long running stateful tasks. My colleague Leo wrote an amazing blog post about how we designed graceful shutdown in Go, which you should definitely read.
Auto scale systems: Automatically scaling systems handle tasks that are running in a distributed system, which is necessary for handling scale, both for Reverse ETL side as well as the consumer (rudder-server). At any given time a Reverse ETL task might be running on a single node, but might have to be picked up by another node if the original node crashes for some reason. On the consumer side (rudder-server), data points might be sent to consumers running on multiple nodes. Guaranteeing lesser duplicates, in-progress successfully sent records, and acknowledging completion of sync tasks are really interesting problems at scale.
Proper metrics and alerts: We added extensive metrics and various alerts, like time taken for each task, number of records processing from extraction to transformation to destination API calls, sync latencies for batches of records, and more.
Central reporting on top of metrics: Beyond just metrics for Reverse ETL, there is a need for a central reporting system as multiple systems are involved in running the pipeline, from extraction to final destination. We wanted to capture details for all stages to ensure we had full auditability for every pipeline run.

Again, RudderStack Core was a huge help in shipping several of the above components of the system:

Destinations: when it comes to integrations, maintenance is critical because things must be kept up to date. Many times things fail because of destination API upgrades or different rate limits, not to mention upkeep like adding additional support for new API versions, batch APIs, etc. Because destinations are a part of RudderStack Core, the Reverse ETL team doesn't have to maintain any destination functionality.
Metrics: rudder-server already included metrics for things like successfully sent counts, failed counts with errors, and more, all of which we were able to use for our Reverse ETL pipelines.
Live Debugger: Seeing events flow live is incredibly useful for debugging while sync is running, especially because we don't store data in RudderStack. We were able to use the existing Live Debugger infrastructure for Reverse ETL.

Concluding thoughts

Building out our Reverse ETL product was an amazing experience. While there were many fun challenges to solve, I have to reiterate my appreciation for the foresight of our founding engineers. As you can see, without RudderStack Core this would have been a much more challenging and time consuming project.

If you made it this far, thanks for reading, and if you love solving problems like the ones I covered here, come join our team! Check out our open positions here.

Making Data Engineering Easier: Operational Analytics With Event Streaming and Reverse ETL

Team RudderStack — Mon, 04 Apr 2022 15:31:54 +0000

The past several years have been somewhat of a revolution in data technologies as tools like the cloud data warehouse have come of age, and principles borrowed from software engineering are rapidly being turned into data-focused SaaS.

There is a huge amount of information about what these tools allow you to do, and rightly so. In many ways the modern data stack fulfills promises that siloed SaaS tools never could. For example, getting data out of data silos to build a complete, single source of truth customer profile is actually possible in the data warehouse/data lake.

Tooling has also emerged that helps companies "put data to work" once they have created some sort of value with it (most often in a data warehouse like Snowflake, Google BigQuery, Amazon Redshift, etc.). This has been called operational analytics. Really, operational analytics is just a marketing term coined to describe the reverse ETL workflow of syncing data between your warehouse and the downstream business tools that marketing teams, sales teams, customer success teams, etc. use to run the company.

Combining all of these tools enables helpful use cases that weren't possible before. The complete customer profile example is a good one to describe this architecture.

Building the profiles in the warehouse requires data integration from data sources across the stack. Running ETL/ELT pipelines and collecting behavioral data gives you the full set of user traits in the warehouse. The next logical step is what some call the last mile, sending the full set of characteristics (or a relevant subset) to the operational systems different teams use to run the business.

This could be marketing automation tools, like Marketo or Braze, sales tools, like Salesforce or Hubspot CRM, customer support tools like Zendesk, business intelligence tools like Tableau or Looker, you get the point.

By delivering enriched data from the warehouse to all these tools, data teams effectively give business teams superpowers, equipping them with comprehensive customer information in the tools they use every day to do things like prioritize outreach, build better email lists, optimize marketing campaigns, craft better customer experiences, etc.

Reverse ETL makes data engineering easier

Because so much attention is placed on use cases for business users, like the one above, it's easy to overlook one of the biggest benefits of all this modern tooling (and reverse ETL tools in particular): they make data engineering easier.

Even before the modern data stack, it was technically possible to build a complete customer profile in an operational system like Salesforce, but it was a painful, expensive endeavor that pushed SaaS tools far beyond their intended uses. It also required multiple full time roles (and often consultancies) to maintain the brittle Frankenstein of connectors and customizations. For most companies, the juice wasn't worth the squeeze.

Thankfully, the modern data stack is making data engineering easier, enabling data teams to focus on more valuable problems, as opposed to building and maintaining low-level data infrastructure just to move data.

In the next section of this post, I'll give an overview of a real-world data challenge we faced at RudderStack and how we used our own customer data pipelines, including our Reverse ETL solution, and warehouse to solve it.

A classic data challenge: getting richer conversions into ad platforms

Instrumenting pixels and conversions from ad platforms is one of the more frustrating tasks for data teams, especially when you try to implement anything more complex than the most basic use cases. In what can only be described as perverse irony, those more complex use cases are actually what unlock the most value from digital advertising because they give the algorithms better data for optimization.

One of the most common use cases is sending some sort of signup or lead conversion to a platform like Google Ads. For RudderStack, those conversions are either form fills on the website or account creations in our app. Thankfully, we don't have to deal with all of the direct instrumentation using something like Google Tag Manager because RudderStack makes it easy to map existing event stream identify and track calls (like "User Signed Up") to conversions in Google Ads.

Sending all signups as conversions is helpful, but our paid team wanted to take it a step further: specifying which conversions represented qualified signups (i.e., MQLs). That sounds simple on the surface, but it's actually a tricky data engineering problem because our MQL definition is both behavioral (did the user perform certain behaviors) as well as firmographic/demographic (company size, industry, job title, etc.). To add even more spice to the challenge, our firmographic/demographic enrichment happens via Clearbit API calls, and the ultimate 'operational home' for the MQL flag on a user is Salesforce.

Think about building a custom solution just to try to get those enriched conversions into Google Ads. Yikes.

The good news is that Google provides a data point that opens a pathway to tying all of this together. We won't go into detail, but it is called the "gclid," which stands for "Google click ID." This value represents a unique click by a unique user on a particular ad. It's appended to the end of the URL that the user is directed to when they click.

Using the gclid and some clever workflows in our stack, we are able to not only send MQL conversions to Google Ads, but also drive much more detailed reporting on our paid advertising performance.

We will cover the step-by-step details in an upcoming guide, but here's the basic workflow:

First, grab the gclid value from the URL property of a page call, and add it to the user profile table in the warehouse

One widespread method of capturing the gclid on a lead record is creating a hidden field on forms or signup flows, grabbing the value from the URL, and populating the hidden field so that on submission, the user record has a gclid value (in some sort of custom field). The main problem with that method is that many users don't fill out the form on the page they land on (they browse the site for more information), meaning the UTM parameter is lost on subsequent pageviews.

Mitigating that problem is fraught with peril and many data engineers attempt the annoying practice of persisting the gclid value in the data layer, often in Google Tag Manager, so that it stays with the user as they browse the site. Not only is this brittle, but becomes an enormous pain when you have to build similar workflows for other ad platforms. Suffice it to say, this is a huge data engineering headache no matter which way you slice it.

RudderStack, dbt, and the warehouse make this much, much easier. We already track anonymous pageviews for each unique user, and those page calls include the URL as a property in the event payload. When the user eventually converts, we have the associated track and identify calls in the warehouse, so we can tie all of the previously anonymous pageviews to that now-known user.

We already run a dbt model that generates enriched user profiles, which includes the MQL flag from Salesforce pulled via a RudderStack ETL (extract, transform, load) job. A few simple lines of SQL allow us to grab any previous pageviews associated with MQLs that include URL properties with a glcid value.

Easy as that, we have the gclid value added to MQL user profiles without all of the mess of trying to persist values, do direct instrumentation with Google Tag Manager, or build a bunch of custom pipelines.

Second, send the gclid value to Salesforce via reverse ETL and sync MQLs to Google Ads via direct integration

At this point, a majority of the operational analytics work is done. We just need to complete the loop by getting the enriched MQL conversions back to Google Ads.

There are a few ways to do this: you could send the enriched conversions directly to Google Ads, but we chose to use the direct integration between Salesforce and Google Ads so that our data engineering team didn't have to maintain additional pipelines.

So, the only thing left to do was get the gclid value into Salesforce. Amazingly, that was already taken care of by the reverse ETL job we were running to add other enriched data points (like lead score) to lead records in Salesforce. Revenue ops created a custom field, which we mapped in our RudderStack Reverse ETL job, and the data flow was complete.

On to more important problems

Our team wired up, tested, and deployed this solution in a few hours. I have battle scars from wrangling I've done in the past to achieve similar data flows, so it was pretty mind boggling to see how easily we accomplished this.

The best part, though, was that it gave our data team time to move on to the bigger, more important problem of collaborating with our analytics team to build better reporting on ad performance.

From First-Touch to Multi-Touch Attribution With RudderStack, Dbt, and SageMaker

Team RudderStack — Mon, 04 Apr 2022 14:32:00 +0000

An overview of the architecture, data and modeling you need to assess contribution to conversion in multi-touch customer journeys

Where should we spend more marketing budget?

The answer to this age-old budget allocation question boils down to determining which campaigns are working and which aren't, but with the amount of data we collect on the customer journey today, it's not always crystal clear which campaigns are which.

Upon first glance, it's fairly straightforward to review metrics in ad platforms like Google ads to examine ROI on paid ad campaigns, and you may even be sending conversion data back to that platform, but you're still restricted to evaluating only a fraction of the user interactions (one or two steps of the conversion path at best) that led up to and influenced a sale.

These limitations are due to the fact that many ad platforms were built when basic marketing attribution was enough for most companies. Today, though, data teams and marketers have more data than ever, increasing their appetite for more advanced attribution solutions. Even so, advanced attribution is still hard. But why?

In short, moving beyond basic single-touch attribution introduces a number of complexities. Instead of binary outcomes (this digital marketing campaign brought someone to our site or it didn't), a user can have many marketing touchpoints, which introduces the idea of influence along a buyer's journey, and that's where things start to get tricky.

To understand which marketing efforts are contributing more to a successful objective (conversion event) we have to evaluate it relative to all of the other campaigns, and this is very complicated. First, it involves collecting and normalizing an enormous amount of data from a lot of different sources, and second it requires the application of statistical modeling techniques that are typically outside the skillsets of many marketing departments.

Neither of these marketing measurement challenges should be addressed by the marketing team. The first (comprehensive data collection) is a data engineering problem and the second (statistical modeling) is a data science problem. Because those are not core skills in marketing, most marketers fall back on last-touchpoint models or outsource complex attribution to third-party attribution tools that don't have a complete picture of the attribution data. The problem is, these types of attribution cannot deliver the deep insights necessary for a holistic, cross-channel understanding of marketing performance across the whole consumer journey from the first touchpoint to the last click. Thankfully, modern tooling can solve this problem across teams.

The challenge: when a user is involved in multiple campaigns across multiple sessions on different devices, how do you know which of these different touchpoints actually influenced the sale?

The RudderStack approach involves building a data set in your warehouse that combines the events (user touches) as well as the metadata (such as the marketing campaign) associated with them. In addition to analyzing campaign performance, we can also use this same data for a variety of machine learning models including lead scoring and likelihood to repeat a purchase.

In this article we will walk through how we recently helped an e-commerce customer realign their marketing budget through the use of a combination of different attribution models. We will start with a high level architecture review and how they use RudderStack to collect all of the data (including marketing spend!) to create a complete story of the user journey in their own data warehouse. Next we will show you how they used dbt and RudderStack Reverse ETL to prepare the data for modeling. In this example AWS SageMaker was used to run the Jupyter Notebook and we will walk you through how the results of multiple models are sent back to their warehouse for reporting.

Architecture: from comprehensive data collection to using multi-touch attribution models

Here is the basic end-to-end flow:

Stream behavioral events from various platforms (web, mobile, etc.) into the warehouse
ETL additional data sets to complete the user journey data set in the warehouse (sales emails, app volume usage, inventory, etc.)
Create enriched user journeys via RudderStack identity stitching
Define conversion and user features using dbt
Reverse-ETL user features to S3
Run python models on the S3 data from a Jupyter Notebook in SageMaker and output results back to S3
Lambda function streams new result records from S3 to RudderStack and routes to the warehouse and downstream tools

The end result is that the enriched user journey data produces a data flow and feature set that can feed multiple different attribution models as outputs, from simple to more complex. This is important because downstream teams often need different views of attribution to answer different questions. On the simple end of the spectrum, knowing how people initially enter the journey, their first touch, is very helpful for understanding which channels drive initial interest, while a last-touch model shows which conversions finally turn visitors into users or customers.

The most important data, however, often lives in between first touch and last touch. In fact, even in our own research on the journey of RudderStack users, we commonly see 10+ total touch points before conversion. Understanding the touchpoints that happen after the first touch that influence the last touch can reveal really powerful insights for marketing and product teams, especially if those touchpoints cost money (in the form of paid campaigns).

Let's dig into an overview of the workflow for this use case. Here's what we'll cover:

A quick explanation of how data engineers can collect every touchpoint from the stack (without the pain)
An overview of how to build basic first touch and last touch attribution
An explanation of why it's valuable to apply additional statistical models for multi-touch attribution and an overview of how feature building in dbt fits into the architecture

The data engineering challenge: capturing every touchpoint

Capturing the entire "user journey" is such a common use case, both for us and our customers, that our teams often take the term for granted. When we talk about user journeys, what we really mean is, "in chronological order, tell me every single touchpoint where a particular user was exposed to our business, whether that be on our website, mobile app, email, etc. and also include metadata such as UTM params, referring URLs etc. that might also provide context or insight about how that particular touchpoint."

But where does all of that data come from? The answer is that it comes from all over your stack, which explains why it's a data engineering challenge. For behavioral event data, our customer uses RudderStack Event Stream SDKs to capture a raw data feed of how users are interacting with their website and mobile app (we have 16 SDKs, from JavaScript to mobile and server-side and even gaming frameworks).

Behavioral data is only one piece of the puzzle, though. This customer also captured touchpoints from cloud apps in their stack. For that they leverage Rudderstack ETL sources to ingest application data from their CRM and marketing automation tools. Lastly, they use RudderStack's HTTP and Webhook sources for ingesting data from proprietary internal systems (those sources accept data from anything that will send a payload).

It's worth noting that RudderStack's SDKs handle identity resolution for anonymous and known users across devices and subdomains. This allowed our customer to use dbt to combine data from cloud apps, legacy data and user event data as part of their identity stitching framework to achieve the coveted 360 view of the customer.

Solving the data engineering challenge is really powerful stuff when it comes to attribution, in large part because data is the entire foundation of good modeling. This is also why we believe in building your customer data stack in the warehouse, where you are collecting all of your data anyways.

Our customer told us that one major advantage of having a flexible stack is that, unlike traditional first touch and last touch analysis in GA or Adobe Analytics, building a solution on the warehouse allowed them to factor in the effect (and cost) of coupons and other discounts applied at checkout (via server-side events) and treat them as alternative forms of paid marketing spend. Additionally, having data from sales and marketing automation tools truly completed the picture for them because they could evaluate the contribution of "offline" activity such as emails opened, even if the recipient didn't click on any links that directed them back to the website. Both of these use cases were impossible for them with third party analytics tools and siloed data.

So at this point, our customer had all of the data in their warehouse and had stitched together customer profiles and journeys using dbt. Then what?

After building user journeys with RudderStack and dbt, they had the foundation for creating robust data models for statistical analysis and ML modeling. For their particular e-commerce use case, we helped them create a model that combined both the touchpoints and the marketing campaigns associated with those touchpoints to create a multipurpose dataset for use in SageMaker. Here is a sampling of some of event types and sources used:

List of Channels & RudderStack Event Types

Paid Channel (Event Name)	RudderStack Event Type
Google - Paid Display (site_visit)	Page
Google - Paid Search (site_visit)	Page
Email - Nurture Newsletter (email_opened)	ETL
Email - Abandoned Cart (email_opened)	ETL
Twitter - Post Organic (site_visit)	Page
Facebook - Display Image Carousel (site_visit)	Page
Email - Retargeting (email_opened)	ETL
Braze - SMS Abandoned Cart (sms_sent)	Track
TikTok - Display (site_visit)	Page
Youtube - Video (site_visit)	Page
In App Messaging - Coupon Offer (coupon_applied)	Server-Side Track
Instagram - Shopping (offline_purchase)	Existing Data In Warehouse
Google - Shopping (site_visit)	Page

Once we had established the touchpoints we needed in the dataset, the next step was defining the conversion---a key component of attribution models. This customer's direct-to-consumer e-commerce use case defined conversion as a customer making any total purchases over a certain dollar threshold (many companies would consider these high value or loyal customers).

It's important to note that this can comprise multiple purchases over any time period on a user level, which is impossible to derive in traditional analytics tools because it requires aggregating one or more transactions and one or more behavioral events and tying that to a known user. In the data warehouse, though, the customer had all of that data and could easily manage timelines because each touchpoint had a timestamp.

Using RudderStack and dbt, we helped them build a dbt model that output a single table of user touches with associated campaigns and flag each user with a timestamp of whether or not the user eventually converted. UTM parameters were extracted from Page calls and woven together with applicable track calls such as "Abandoned Cart Email Coupon Sent," and then again combined with other data from ETL sources in the warehouse as outlined in the table below.

Row	name	type
1	USERID	VARCHAR(16777216)
2	EVENT_CATEGORY	VARCHAR(9)
3	EVENT	VARCHAR(16777216)
4	CHANNEL	VARCHAR(16777216)
5	TIMESTAMP	TIMESTAMP_LTZ(9)
6	CONVERSION_TIME	TIMESTAMP_LTZ(9)
7	ROW_ID	VARCHAR(16777216)

The output table was designed to serve a variety of different statistical and ML models beyond this use case and includes the following columns:

userId: RudderStack user identifier from the various events table. In our case we will use the Rudder ID created from the identity stitching model. For B2B applications, this could be AccountID, Org_ID, etc.
event_category: The type of event being sourced. Not used in this analysis but may be useful for filtering or other ML modeling.
event: The name of the specific event used to generate the touch. Again, this field is not used in our attribution modeling but will be used in other models.
channel: The marketing channel attributed to this particular event. As we will see in our dbt, this could be driven by a UTM parameter on a page event or it may be extrapolated from the source itself (ie, braze SMS messages, email opens form customer.io, server-side events or shipping data already in the warehouse).
timestamp: This will typically be the timestamp value on the event itself but could be any timestamp to indicate when this particular touch occurred
conversion_time: This represents the timestamp of when the user had their first qualifying order total. This is computed in a different step within the dbt and applied to all of the events for that particular userId. If the user has not completed the checkout process, this will be null. It is important to note that we do not want any events for a particular user after the time the user converts.
row_id: The sequence identifier for each userId. This is used by the RudderStack reverse etl to support daily incremental loads for new events each day.

With the data set created in the warehouse, the customer connected RudderStack's Reverse-ETL pipeline to send the table to S3, where the attribution modeling was executed in SageMaker and Jupyter Notebooks. They then used a Lambda function to send the results back through RudderStack and into the warehouse, where the team could begin interpreting results. (Keep your eyes peeled for a deep dive into that workflow in an upcoming post.)

Here's a visual diagram of the architecture:

Starting with first and last touch attribution

As we said above, this customer wanted to evaluate various attribution models to answer different questions about their marketing spend and the customer journey.

They started simple, with basic first touch and last touch models. As we said above, every touchpoint is timestamped, so it was fairly simple to extract first/last touch and first/last campaign attribution. This customer in particular was interested in comparing that attribution across first and last touch models, which was simple to achieve within the same report, SQL query, etc. Interestingly, they said this was incredibly valuable because a similar comparative analysis couldn't be performed in Google Analytics or using a spreadsheet to export last touch attribution from a CRM.

The problem with last touch attribution

Last touch attribution is the most common way to assign credit for conversion. As simple as it sounds, this is often the case because it's the easiest kind of attribution to track, especially in tools that were never designed for attribution (custom fields in Salesforce, anyone?). For the sake of clarity, a last touch attribution model assigns all of the 'credit' to the last touch. So, if a conversion is valued at $x, and the user interacted with four different campaigns before conversion, only the final campaign gets the whole credit of $x while the 3 previous campaigns get zero credit.

This becomes a major problem when the campaigns' goals are different. For example, some campaigns may aim for brand awareness, which almost always means lower conversion rates. When brand campaigns do convert, it usually happens over a much longer period of time, even after the campaign has ended, or as an 'assist' that brings the user back in prior to conversion. So, even if certain brand campaigns are extremely influential on eventual conversion, last touch attribution models don't afford them the credit they deserve. This is particularly important when marketing teams are trying to optimize the balance between spend on brand campaigns vs conversion campaigns.

We see this scenario across all of our customers, be they B2B or B2C, and the larger the sale, typically the flatter the tail. The chart below shows a typical Days to Conversion chart and highlights how last touch can grossly overstate the influence of a last touch campaign's significance.

Better options through statistical modeling

With the complexity of today's marketing environments and the limitations of last touch modeling, we must consider more complex alternatives for assigning the appropriate credit to the appropriate campaign and consider the full path up to the point of conversion---which is exactly what our customer set out to do.

This problem of attributing user conversion to all touches throughout the journey is called Multi-Touch Attribution (MTA). This itself can again be done in various rule-based approaches. Some examples of these rules are:

Linear Attribution Model: This approach gives equal credit to all touches.
Time-Decay Model: More recent touches are weighted more and the longer ago a touch occurred, the less weight it receives.
U-Shape Attribution: Similar to Time-Decay except the first and last touches get higher credit and intermediate touches get less.

These are all heuristic based rules and can be arbitrary. At RudderStack we recommend a more data driven approach. We routinely employ these two established methods:

Shapley values are derived from Game Theory, and they essentially capture the marginal contribution from each touch towards the ultimate conversion.
Markov chain based values capture the probabilistic nature of user journeys, and the removal effect of each touch point. They also highlight the existing critical touches in the journey - points where if something goes wrong, the conversion probability is negatively impacted.

Here's (Fig 3) how our results look using these three models:

Paid Channel (Event Name)	Last Touch	Markov	Shapley Values
Google - Paid Display (site_visit)	7.5%	4.1%	2.0%
Google - Paid Search (site_visit)	15.0%	8.2%	10.0%
Email - Nurture Newsletter (email_opened)	1.0%	6.9%	10.0%
Email - Abandoned Cart (email_opened)	20.0%	9.1%	15.0%
Twitter - Post Organic (site_visit)	2.5%	11.3%	3.0%
Facebook - Display Image Carousel (site_visit)	4.0%	4.8%	10.0%
Email - Retargeting (email_opened)	7.0%	8.2%	3.1%
Braze - SMS Abandoned Cart (sms_sent)	7.5%	4.6%	10.0%
TikTok - Display (site_visit)	5.0%	8.8%	7.3%
Youtube - Video (site_visit)	7.5%	7.5%	9.8%
In App Messaging - Coupon Offer (coupon_applied)	12.0%	13.0%	8.7%
Instagram - Shopping (offline_purchase)	1.0%	4.3%	10.0%
Google - Shopping (site_visit)	10.0%	9.2%	1.2%
Total	100%	100%	100%

Helpful insights in comparing models

When our customer evaluated the returned results in their warehouse, they uncovered some pretty interesting and helpful insights:

Last touch based attribution gives a very high weight to abandoned cart emails. Anecdotally this makes sense as users are likely enticed by a coupon for something they've already considered purchasing and this is the last activity they engage with prior to purchasing. On the other hand, both the Markov and Shapley values suggest that while this may occur just before a conversion, its marginal contribution is not as significant as the last touch model would suggest (remember, the key conversion is total purchases above some value). Instead of continuing to invest in complex abandoned cart follow-up email flows, the customer focused on A/B testing abandoned cart email messaging as well as testing recommendations for related products.
In the last touch model, Instagram purchases don't look like a compelling touchpoint. This alone was valuable data, because Instagram purchase data is siloed and connecting activity across marketplaces is complicated. Again, using the warehouse helped break those silos down for our customer. Interestingly, even though last touch contribution was very low, it was clear from the Shapely values that Instagram purchases were a major influence on the journey for high value customers. So, in what would have previously been a counter-intuitive decision, the customer increased marketing spend to drive purchases on Instagram and included content that drove customers back to their primary eCommerce site.
Markov values for Twitter (organic) posts are much higher compared to the Shapley values. This showed our customer that not many people actually make a significant purchase based on Twitter posts, but when they do, they have very high intent. The customer reallocated spend from Google, which was overrated in the last touch model, and invested in promoting organic Twitter posts and creating new kinds of organic content.
The Facebook Display campaign has a high Shapley value but low Markov value, which indicates a high dropoff rate from people after seeing ads on Facebook. Based on this insight, the customer moved budget from Facebook to TikTok and YouTube, both of which had far less dropoff.

Conclusion

The only way to truly understand what campaigns are working is to have full insight into every touchpoint of a user's journey. RudderStack eliminates the engineering headaches involved in collecting and unifying all of your data feeds and reduces the build time for your data scientists with its uniform schema design and identity stitching. If you would like to learn more about how RudderStack can help address your company's engineering or data science needs, sign up free today and join us on slack.

RudderStack Product News - Vol. #020: SDK Event Filtering, Beacon Support for JS, and OneTrust for JS

Team RudderStack — Tue, 22 Feb 2022 07:46:16 +0000

SDK Features: Client-side Event Filtering, Beacon Support for JS, and OneTrust for JS

We're excited to announce a series of new features for our SDKs. Our SDK event filtering lets users specify which events should be discarded or allowed to flow through on the client.

In addition to this, Beacon Support is now available for JS. You can now use sendBeacon with the latest version of our JS SDK, which provides improved performance via an asynchronous process.

We now support OneTrust as a consent manager for JS SDK. This consent manager offers various cookie consent solutions that allow customers to determine what personal data they are willing to share with a business.

Try Now

Airflow for Reverse ETL

RudderStack's new Airflow Provider lets you schedule and trigger your Reverse ETL jobs from outside RudderStack and integrate them with your existing Airflow workflows.

Documentation

Event Volume for Sources

The sources page now has a new Events tab that includes information on event volume over time. Volumes are also calculated on a per-event basis so you can understand the footprint of different events in your tracking program. Log into our product to see the latest changes.

Learn More

Email Notifications on Warehouse Sync Failures

We now send out email notifications on warehouse sync failures so you can act on them immediately. Our goal is to reduce silent failures across the ecosystem and expand our notification capabilities over time.

Try Now

Delta Lake Destination

RudderStack now supports Delta Lake as a destination. Delta Lake is an open format storage layer that delivers reliability, security and performance on your data lake.

Documentation

Integrations

Statsig Cloud Mode Integration

RudderStack supports Statsig as a destination to which people can send real-time event data for efficient A/B testing. Statsig helps companies safely A/B test features in production before rolling them out, avoiding product debates and cost issues when shipping new features.

Learn more in the docs.

LaunchDarkly Web Device Mode Integration

LaunchDarkly is a popular feature management platform that offers high quality A/B testing and experimentation functionalities. RudderStack supports LaunchDarkly as a destination to which customers can seamlessly send their data for flag management.

Learn more in the docs.

TikTok Ads Cloud Mode Integration

TikTok Ads is TikTok's online advertising platform. We now support TikTok Ads Cloud Mode Integration on web, mobile, and server.

Learn more in the docs.

Candu Cloud Mode Integration

Candu is a product experience platform that provides no-code web tools for SaaS applications. RudderStack now supports Candu Cloud Mode on web, mobile, and server.

Learn more in the docs.

Other happenings at RudderStack

🎧 The Data Stack Show: What a High Performing Data Team (and Stack) Looks Like

Check out this episode of The Data Stack Show to hear Paige Berry from Netlify talk about how they organize their data team for success.

Listen

📄 On the Blog: What is a Customer Data Platform?

Read this post from our Director of Product, Brian Lu, to learn about foundational principles of the customer data platform and get a few CDP evaluation tips.

🖥 DC_Thurs: Interview with RudderStack's Founder and CEO

In this episode of DC_Thurs from Data Council, RudderStack Founder and CEO, Soumyadeb Mitra, chats with Pete Soderling about how uses for the customer data platform extend well beyond marketing.

Watch now

Enabling the Customer Data Stack: RudderStack Series B Funding

Team RudderStack — Mon, 14 Feb 2022 09:42:55 +0000

Today, I'm thrilled to announce RudderStack's $56 million Series B funding led by Insight Ventures with continued support from Kleiner Perkins and S28 Capital. This brings our total funding to $82 million. Our Series B is a big step forward in helping our customers build the best data stacks possible.

In 2019, I co-founded RudderStack with the goal to help data engineers build the infrastructure required to help their businesses understand their users and serve their needs. I was inspired by my experience at my previous company, where I spent a year building customer data pipelines. In the process, I learned about many of the challenges that both myself and my peers were facing as we were collecting and processing customer data at enterprise scale. This funding round is a testament to the progress we have made in addressing these challenges for our customers.

With this funding round, we intend to accelerate investments in our product to enable engineers everywhere to build future-proof customer data infrastructure.

I am proud of what this team has accomplished in just a few short years and am looking forward to what the future holds.

Read the TechCrunch article featuring our Series B to learn more.

What Is a Customer Data Platform?

Team RudderStack — Thu, 10 Feb 2022 13:19:34 +0000

The goal of a customer data platform (CDP) is to (1) build a unified view of your customers and (2) utilize these profiles to enrich the quality of customer interactions across multiple touchpoints. With a "build once, use everywhere" mentality, organizations can use a CDP to make sure their customer interactions are personalized, consistent across touchpoints, and valuable for all parties involved.

What does a customer data platform do?

Customers interact with an organization through marketing, sales, support, and the product itself. CDPs help collect and aggregate this information, build customer profiles, and propagate this information to downstream systems. Operational systems take advantage of these profiles to deliver personalized messaging, contextualized interactions, and custom experiences. To achieve this, there are several major things a CDP does:

Bring in the data. Customer Data Platforms have integrations with a variety of first-party data sources that capture customer information in relation to an organization. This includes marketing, sales, support, and the product itself.
Build the profiles. Customer Data Platforms build unified customer profiles (oftentimes called "Customer 360") that are a combination of static traits as well as aggregate behavior information.
Make the profiles useful. Once the profiles are built, they are utilized by internal apps, sales/marketing/support vendors, and other systems to take advantage of the data to personalize communications or contextualize interactions.

Bringing in the data

Since customer interactions are captured in a variety of tools and software, there are many

This includes information such as data used within the product, metadata (e.g. geography), and behavioral data (product usage, marketing responsiveness, sales/support communications). This data is often referred to as "first party data" because it is the customer's interaction with the organization itself rather than information bought from outside parties.

There are several ways of bringing in data, many of which are understood in the context of the modern data stack. These are the most common examples.

Application Event Tracking. Organizations can leverage SDKs across a variety of languages that enable them to track and understand user behavior and flows throughout a software application. This can be done either on the client (Mobile, Web, Smart Device) or on the server (Python, Java, Go, etc.) with trade-offs for each.
Vendor Webhooks. Some software vendors expose webhooks on activity that happens on their platform. These data streams are private to the organization using the vendor and provide information on how the vendors are used.
Production Databases. Production databases are a good source of truth for application transactions or production data models. These data streams are often brought in via ELT/ETL (extract, transform, load) processes that leverage modern techniques like change data capture (CDC).
Vendor Data Extracts. Vendors that directly engage with customers on behalf of an organization often hold the data within their systems. This data can be extracted in batch through vendor provided APIs.

Building the profiles

Bringing in customer data enables you to build unified, comprehensive customer profiles based on every touchpoint with your organization. Given the data that is collected through upstream integrations, a customer data platform builds a single view of the user that can be understood, filtered, aggregated, or propagated, depending on the use case.

Advanced features in building customer profiles include:

Personalization. Machine learning models or simple heuristics can help predict what customers are interested in based on existing interaction. This includes personalized recommendations, personalized messaging, or cohort-specific customer journeys. Here's how one of our customers leverages RudderStack and Redis for real-time personalization.
Customer Health. Customer health is based on a combination of engagement (or lack thereof), quality of interaction, direct feedback, and more. This information can help guide messaging that is encouraging, conciliatory, determined or exploratory depending on how the customer feels about the company/product.
Identity resolution. Customer interactions across many tools often share an identifier, but pseudo-anonymous interactions (e.g. pre-login) benefit from being stitched back together to customers once you determine who they were. Learn more about identity resolution.

Making the profiles useful

Activating data on a customer data platform involves making it available to downstream communication platforms, whether it be sales, marketing, support, or the product itself. These are often the same tools used to capture these interactions from the "Bring in the Data" section above.

This can involve either pushing the downstream systems or exposing an interface (e.g. API endpoint) that enables vendors to access and aggregate customer data. If the data lives in a cloud data warehouse, there may be technological overlap with products in the Reverse ETL category which help push data from warehouses to downstream integrations.

In addition to using data for individual customers, CDPs enable organizations to build audiences lists or properties based on features of audiences. For example, if you want to target all customers from Europe who joined in the past 5 days, a CDP would enable you to leverage the central customer profile to build audience lists that can be used in communication platforms.

What should you look for when evaluating a CDP solution?

The best customer data platform for you depends on a number of factors you may value for your business. Data storage, privacy, control, cost, completeness, extensibility, and more. Here are a few considerations that should inform your CDP evaluation:

How complete is the data coming in? With the new wave of privacy-conscious browser restrictions designed to improve user privacy and to weaken the marketplace of data brokers, first-party event collection systems are often affected even though they are not the intended target. Effective CDPs have a number of approaches to tackling external challenges and verifying data quality.
How good is the data coming in? Do you need tight controls over the data quality of the data coming in from your application, your vendors, and your databases? Some CDPs support a concept of Tracking Plans to verify that data coming in has a consistent format that can be used to accurately update a user profile.
How much control do you have over your data? When working with a customer data platform, do you need to interact with the data directly, or are you ok working with the data through interfaces provided by the CDP? If a CDP shares its data with you or lives on a cloud data warehouse, then you have more visibility into your data and can re-use your customer data profiles for custom processing and analytics. Read more on data control.
What channels do you use to communicate with customers? B2B organizations have high-value investments in a smaller number of customers through sales processes, while B2C (such as ecommerce or marketplaces) organizations traditionally invest more in digital marketing and advertisements. Different CDPs have product optimizations or cost models that may be more advantageous for one or the other.
How comprehensive is your data privacy program? Data privacy laws such as GDPR and CCPA put stricter controls on the collection, access, storage, and facilitation of customer data. A CDP should be able to provide compliance for both data within its system as well as forward deletion requests to downstream systems. In addition to this, some CDPs may go one step further with compliance for medical data (HIPAA). Read more on the role your CDP can play in data privacy.

Ultimately, the customer data platform is an evolving concept that brings together multiple elements of the modern data stack into a very clear set of goals that enable organizations to make the most out of first-party customer data.

Announcing Delta Lake and Data Lake Destinations on RudderStack

Team RudderStack — Tue, 08 Feb 2022 12:14:12 +0000

We're excited to announce that Rudderstack now supports Delta Lake as a destination in addition to data lakes on Amazon S3, Azure, and Google Cloud Storage.

Data lakes are storage areas for raw data that make the data readily available for people to use when they need it. The data in data lakes should be raw, easily accessible, and secure. From a customer data perspective, data lakes are a very powerful option because of their flexibility and ability to handle high volumes of information.

Last week we wrote a post on how the data lakehouse enhances the customer data stack. As a Lakehouse, Databricks' Delta Lake is an architecture that builds on top of the data lake concept and enhances it with functionality commonly found in database systems. It lets users store structured, unstructured, and semi-structured data. This Delta Lake was created in part to make sure that users never lose data during ETL and other event processing. It allows companies to scale and supports a variety of capabilities such as ACID transactions, scalable metadata management, schema enforcement, and more.

Today, RudderStack customers are using our tool to seamlessly send event data. Click here to access the Warehouse Destinations section of our support documents, where users can learn more about how to send event data from RudderStack to their Data Lake and Delta Lake destinations.

How Does The Data Lakehouse Enhance The Customer Data Stack?

Team RudderStack — Mon, 31 Jan 2022 11:14:21 +0000

Reading the title of this post will probably make you wonder how many buzzwords can fit in one sentence. A fair point, but it's worth exploring how a customer data stack benefits from the data lakehouse. First, let's clarify exactly what we mean by customer data stack and data lakehouse.

What is a customer data stack?

We talk a lot about the modern data stack, but it's important to make a distinction here because customer data is special. It provides a unique value to the organization, and it comes with a unique set of technical challenges.

Customer data provides unique value because it's the main source of behavioral information a business has about its customers. Anyone who has ever built a business will tell you that if you don't know your customer, you'll be out of business fast. This is especially true for modern online businesses where direct customer interactions are rare.

But, like all valuable things, customer data comes with a cost. Remember those unique technical challenges I mentioned? Let's take at the characteristics of customer data that make it difficult to work with.

Customer data comes in big quantities. Modern commerce, driven by online interactions, generates a massive amount of data every day. Just ask anyone who's worked at a large B2C company like Uber or even a medium-sized e-commerce company. The sheer volume of data generated by even simple interactions is huge.
Customer data is extremely noisy. The customer journey always involves many actions, but not every action holds value. The issue here is that there's no way to know what's valuable and what isn't before analysis. Your best bet is to record everything and let your brilliant data analysts and data scientists shine.
Customer data changes a lot. No one has the same behavior forever, right? I mean, only dead people have a behavior that remains the same across time. This means you need to keep accumulating big amounts of data, but only some of them will be relevant at a specific point in time.
Customer data is a multidimensional time series. This sounds very scientific, but all it means is that time ordering is important, and there's no single value for each data point. This adds to the complexity of the data and how you interact with it. You can read about how we implemented our queueing system using PostgreSQL if you'd like to go down a rabbit hole with this one.
Finally, customer data can be pretty much anything, from very structured to completely unstructured. For example, an invoice issued at a specific time for a specific customer, that's customer data. So is an interaction that the same customer has on your website. Even the picture of that customer, taken as part of a verification process, is customer data.

I bring all of this up to convince you that working with customer data is important, and it's challenging enough to require some unique choices to be made when you build your data stack. Thus, we have the term Customer Data Stack.

TL;DR, A Customer Data Stack is a complete Data Stack (modern or not, it doesn't matter) that allows you to capture, store and process customer data at scale.

What is a Data Lakehouse?

If you've been paying attention lately, you've heard a lot of buzz around the terms data lake and data lakehouse. The data lakehouse is even younger than the data lake which, to be honest, is pretty old! Humans started building data lakes with the inception of HDFS and Hadoop. Yes, that old. A distributed file system and Map Reduce is all it takes to build a data lake.

Obviously, many things have changed since the early 00's when it comes to data technology. The term data lake has now been formalized, and the lakehouse is the new kid on the block.

But, before we get into any more details, let's make something clear. When we talk about data lakes and lakehouses, we mainly refer to an architecture and not to a specific technology. It's important to keep this in mind because a lot of the confusion around these two terms comes from this misconception.

Data Lake

Let's start with the definition of a data lake. Getting clear on this first will help us understand the data lakehouse.

The main concept behind a data lake is the separation of storage and compute. Sounds familiar right? Snowflake talks about this a lot, and so does Databricks. But, if you noticed, at the beginning of this section I mentioned HDFS and Hadoop. The first data lakes, using HDFS as a distributed file system and Map-Reduce as a processing framework, separated storage (HDFS) from processing (Hadoop - MR). This is the fundamental concept of a data lake.

Today, when we talk about a data lake, the first thing we think about is S3 which replaces HDFS as the storage layer and brings storage on the cloud. Of course, instead of S3 we can have the equivalent products from GCP and Azure, but the idea remains the same: an extremely scalable object storage system that is sitting behind an API and lives on the Cloud.

Processing has also evolved since Hadoop. First, we had the introduction of Spark that offered an API for Map-Reduce that was more user-friendly, and then we got distributed query engines like Trino. These two processing frameworks co-exist most of the time, addressing different needs. Trino is mainly used for analytical online queries where latency is important while Spark is heavily used for bigger workloads (think ETL) where the volume of data is much bigger and latency is not so important.

So, having S3, using something like Parquet for storing the data, and using Trino or Spark for processing the data gives us a lean but capable Data Lake.

This architecture is good on paper and scales amazingly well, but there are a number of functionalities commonly found in data warehouses and transactional databases that are not present. For example, we haven't mentioned anything about things like transactions. This is the reason why Lakehouse exists.

Data lakehouse

A Lakehouse is an architecture that builds on top of the data lake concept and enhances it with functionality commonly found in database systems. The limitations of the data lake led to the emergence of a number of technologies including Apache Iceberg and Apache Hudi. These technologies define a Table Format on top of storage formats like ORC and Parquet on which additional functionality like transactions can be built.

So, what is a data lakehouse? It's an architecture that combines the critical data management features of a data warehouse with the openness and scalability of the data lake.

Data Lakehouse as the foundation of a Customer Data Stack

Now that we've covered the definitions and utility of the customer data stack and data lakehouse, I'll make a case for leveraging the data lakehouse in the customer data stack.

When building a data stack, one of the most important and impactful decisions you'll make is with the storage and processing layer. In most cases, this is a data warehouse like Redshift or Snowflake. Here are the main benefits of using a lakehouse when dealing with customer data instead:

Cheap, scalable storage. Because of the characteristics of customer data we covered above, to efficiently work with it, you'll need a technology that can store it at scale and keep costs as low as possible. The only architecture that offers this efficiently is a Lakehouse (or data lake), which allows you to build on top of the cheap infrastructure offered by a cloud object storage service. None of the cloud warehouse solutions can offer that, especially if you don't want to maintain any pruning policy for older data.
Support for every format. The nature of customer data means you might have to both store and process completely heterogeneous data, ranging from structured tabular data to binary data. Cloud data warehouses, though they've made progress towards supporting structured and semi-structured data, still have a hard time supporting every possible data format. Data Lakes and Lakehouses don't suffer from this limitation. They offer a futureproof option capable of supporting every format we currently use.
Hybrid workloads. Another big consideration is that customer data usually results in hybrid workloads. What I mean by this is that you will want to do operational analytics on your customer data, and soon you'll need to incorporate more sophisticated ML techniques to do things like churn prediction and attribution (if you aren't already). Data lakes and lakehouses are the best fit for covering both workloads, and they do it pretty natively. Cloud Data warehouses do now offer functionalities for ML, but they're limited compared to what you can do on a data lake or lakehouse. So, a lakehouse is pretty much the de facto storage-processing solution when DS and ML projects are involved. This single solution that can serve all the use cases relieves companies from having to implement both a CDW and a data lake, saving cost and operational complexity.
Lakehouses are open. Compared to the cloud data warehouses on the market right now that are proprietary and closed systems, lakehouses are open. That means there's an ever-evolving ecosystem of technologies and products that can supplement the data lake/lakehouse, allowing the operating team to be extremely agile when it comes to maintaining and extending the customer data stack.

Final thoughts

If you are still with us, you hopefully have a bit of a better understanding of why we need and use all these buzzwords and, most importantly, how they fit together.

Data lakes and lakehouses are quickly becoming fully-featured data warehouses with infinite scale and low cost. Customer data is a natural fit for such a storage and processing layer. These architectures will allow you to support any possible use case you might have in the future, even if all you care about today is operational analytics.

Here at RudderStack, we are strong supporters of the data lake and lakehouse architectures. We have invested our resources to build best-in-class integrations with both data lakes and lakehouses like Delta Lake. By using RudderStack together with a Lakehouse or data lake, you can have a completecustomer data stack that scales amazingly well and allows you to do anything you want with your data.

RudderStack and Braze Power Advanced Customer Engagement

Team RudderStack — Mon, 24 Jan 2022 06:44:22 +0000

Creating and delivering deeply personalized customer engagement campaigns just got easier. Expanding on our existing Braze destination integration, RudderStack now features a Braze Currents source integration. The new integration allows Braze users to create a continuous data feedback loop with the customer engagement platform, unlocking even more critical context to fuel stronger campaigns and enrich other tools in the stack.

Leading customer engagement tooling

With Braze, brands have the keys to customer engagement at their fingertips. The Braze customer engagement platform combines the context needed for personalization with the tooling necessary to execute and optimize sophisticated campaigns across multiple channels. Braze even offers AI-enabled insights, so brands can calibrate campaigns in-flight. With its data driven approach, Braze underpins the positive engagement experience that serves as a pillar of growth in the digital age.

Context (data) is king

Deeply personalized engagement campaigns require an immense amount of contextualizing data. Without the proper context, marketers can send irrelevant, annoying, and, in some cases, offensive messages. When it comes to digital customer engagement, strong relationships require accurate, robust customer profiles built on behavioral data. Braze is an incredibly powerful tool, but without data, it's like a racecar with no gas. Enter, RudderStack.

Reliable, complete data delivery

RudderStack's Event Stream makes it easy to get behavioral data from web and mobile sources into Braze reliably and in real-time. This means brands can create holistic customer and behavioral profiles within Braze, then leverage the platform's advanced tooling to quickly build deeply personalized campaigns based on full and accurate information. All of this data in Braze is valuable and actionable. With our expanded partnership, Braze is no longer a cul-de-sac for its value---users can stream engagement events directly from Braze, through RudderStack, and send them to their entire customer data stack.

Next-level engagement with a continuous data loop

With the Braze Currents integration, Braze becomes both an event destination and an event streaming source in RudderStack. Practically, this means all of the valuable engagement data in Braze can be streamed from Braze into other destinations, including the data warehouse, where it can be enriched with even more valuable context, and looped back into Braze. With additional context on top of already robust customer profiles, Braze users can unlock the next level of personalization. This deeper integration raises the bar for delighting customers with great digital experiences. The Braze Currents integration is currently in Beta. If you would like to participate please contact us.

Learn more about data activation with RudderStack & Braze

Register here, and join us for a live webinar on February 10th at 12:00 pm PT featuring leaders from RudderStack and Braze. You'll learn how you can leverage a continuous, bi-directional data loop between Braze and your data warehouse to fuel stronger customer experiences.

RudderStack Product News Vol. #019 - Destination UI

Team RudderStack — Fri, 21 Jan 2022 07:57:48 +0000

Destination UI

We now offer event volume on a per-destination basis. In the destination UI, we let users see errors up to the last 30 days.

Try Now

Reverse ETL Visual Data Mapper (VDM) for Klaviyo

We released our new VDM for Klaviyo. E-commerce teams use Klaviyo as a platform that supports unique features such as category-based segmentation, event triggers based on page views, and more.

Intercom

We are now supporting Intercom as reverse-ETL VDM. Intercom is a real-time business messaging platform that lets teams bring their customer lifecycle activities into one place.

Learn More

Chargebee

Chargebee is a subscription management system that helps teams handle all aspects of the subscription lifecycle including invoicing, recurring billing, and trial management. RudderStack now supports Chargebee as an extract source.

Documentation

watchOS

We're excited to announce our SDK for watchOS, the operating system developed for Apple Watch that is based on the iOS operating system. You can leverage the watchOS SDK to develop applications with interactive user experiences.

Documentation

tvOS

tvOS is Apple's operating system for Apple TVs that are second generation or later. With the RudderStack SDK for tvOS, you can develop applications that couple excellent picture quality and sounds with interactive user experiences.

Documentation

Chromecast

Google Chromecast is a device that plugs into your TV or monitor with a HDMI port. It allows you to stream content from your phone or computer. RudderStack supports integrating the JavaScript SDK with the Cast App.

Documentation

Integrations

BigQuery Stream

Google's BigQuery lets you stream event data by leveraging its streaming API. RudderStack supports this destination for real-time streaming.

Learn more in the docs.

Dynamic Re-marketing for Google Ads

We have released the dynamic re-marketing feature for Google Ads web device mode destination.

Learn more here.

Google Optimize

RudderStack supports Google Optimize, Google's free web optimization tool, as a destination to which you can send your website data for efficient A/B testing.

Learn more here.

Post Affiliate Pro

You can now send your RudderStack event data to Post Affiliate Pro, a popular affiliate marketing software that lets you manage, track, and boost your lead generation efforts for affiliate programs.

Learn more here.

AppsFlyer

AppsFlyer is a SaaS platform that tracks mobile attribution and marketing analytics. RudderStack now supports ingesting event data from AppsFlyer into our platform.

Learn more here.

App Center

Microsoft's App Center lets teams manage their app's lifecycle seamlessly. Read how teams can ingest their event data from App Center to RudderStack.

Learn more here.

Other happenings at RudderStack

Live Webinar January 27: New Year, Better Event Data with Avo & Rudderstack

Join us live on January 27 to learn how RudderStack + Avo combine to increase your event data quality and streamline your behavioral data pipelines. January 27 @ 9am PT / 12pm ET / 5pm GMT.

The Data Stack Show: What is the Modern Data Stack?

If you missed the live stream last month, you're in luck. The panel discussion is now available as a regular episode. Tune in to get insights on the modern data stack from leaders at Databricks, dbt, Fivetran, Hinge, and Essence VC.

Listen

On the Blog: 5 Million Users a Day From Snowflake to Iterable

We love helping our customers solve problems, and we love showing off our product. In this piece Customer Success Engineering Lead, David Daly, details how RudderStack's flexibility helped a customer overcome a few challenges related to bulk subscription enrollment. Read more

Predictions: The State of Data Engineering in 2022

After a historic year, we revisit our 2021 predictions and make a few for 2022. Check out this post to get our insight on what's sure to be another exciting year in data.

Refactoring RudderStack's High-performance JavaScript SDK

Team RudderStack — Fri, 21 Jan 2022 07:55:51 +0000

Since its initial release, we've refactored our JavaScript SDK multiple times, and we've written about how previous improvements reduced execution time from 200ms to 20ms.

Since then, the JavaScript SDK has grown in size as we've added support for new device-mode integrations. It became bulky enough to start impacting load times, so we recently introduced a new, optimized version of the SDK.

Here, I'll detail the improvements made with this refactoring, walk through our team's decision-making process, outline the tradeoffs we considered, and showcase the results of our work.

Key improvements

To optimize the size of the SDK and improve its performance, we focused on three key items:

Freeing the SDK of all integrations code upon build.
Clearing technical debt
Replacing third-party package dependencies

Freeing the SDK of integrations code upon build

Instead of statically importing device-mode integration modules into the core module, the integration modules are now built into independent plugins (scripts) that can be readily loaded on the client-side. Once the load API of the SDK is called, the necessary destination integrations are identified from the source configuration (pulled from the control plane) and their plugins are asynchronously loaded one after another from the hosted location*. After a timeout, the successfully loaded integration modules are initialized to proceed with the events forwarding.

*The hosted location defaults to RudderStack's CDN. In the case of a custom hosted location, this can be overridden via the 'destSDKBaseURL' option in the 'load' call. Additionally, SDK determines this URL based on the script tag that adds the SDK on the website (provided the file name is still "rudder-analytics.min.js").

Clearing technical debt

We removed as much bloat from the SDK as possible. This included dead, redundant, and deprecated code along with deprecated auto-track functionality.

Replacing third-party package dependencies

Wherever possible, we replaced third-party package dependencies with lighter ones. A few cases required custom implementations in order to achieve the results we were looking for.

Why did we decide on this approach?

By design, all the device-mode integrations are independent of each other, so it didn't make sense to bind everything together as a single piece. Moreover, because each customer will only connect a subset of device-mode integrations to their JS/web source, loading only the necessary integrations on their site is the ideal scenario. These improvements also involved minimal changes to our SDK and processes when compared to other alternatives.

One alternative we considered was to dynamically build the SDK with necessary integrations when the request is made to https://cdn.rudderlabs.com/v1.1/rudder-analytics.js/<write key>. Using this approach, the device-mode integrations are packaged with the core SDK and delivered based on the write key provided in the URL.

We saw a few disadvantages to this approach:

CDN costs would increase because we would have to cache a different version of the SDK for every write key
We wouldn't be able to take advantage of browser caching across various websites the user visits
Migrating existing users would be challenging

What tradeoffs did we have to make?

Fortunately, this refactoring didn't involve any major tradeoffs, but there are two worth noting:

CDN Costs: Hosting all of the individual device-mode integration SDKs means increased CDN costs. Luckily, the additional cost is not a significant burden.
Migration costs: To make migrating to v1.1 worthwhile for our customers, we knew we needed to (1) introduce significant performance improvements over v1, and (2) make migrating as easy as possible. We were able to introduce significant improvements, which I'll highlight below, and we worked to make migration as painless as possible. In most cases, migration is complete in a few simple steps, which we documented in a migration guide to help customers with all their deployment scenarios.

Problems we had to solve

In v1, all the integrations were exported from their module as the default type. We had to convert all of them to named exports for them to be dynamically loaded. See the below example:

Default type


import  Amplitude  from  "./browser";

export  default  Amplitude;

Named export


import  Amplitude  from  "./browser";

export  {  Amplitude  };

Additionally, we had to write a script to build all the individual integrations in one go. This is what allows us to deploy the integrations along with the core SDK.

Results of the refactoring

Our new SDK is lighter and faster than the previous version. To put it into numbers:

We reduced the SDK size by 70%. (114 KB to 34 KB)
SDK download times are 80% faster (9.44 ms to 1.96ms)
Script evaluation times are 28% faster (86 ms to 63 ms)

Check out the PR for the refactoring on Github.