Forem: Rich Dudley

EventBridge: The Enterprise Service Bus You Probably Didn't Realize Was An ESB

Rich Dudley — Thu, 16 Sep 2021 02:40:02 +0000

AWS takes a lot of heat for its product names, usually becasue they're weird, and that's especially true for EventBridge but for a different reason. EventBridge is often marketed as the successor for Cloudwatch events, but that description greatly undersells the service. EventBridge is more than just a relay for things which happened, it is by any defintion a fully-fledged enterprise service bus, and is worthy of consideration with Kafka or MuleSoft or other event-based messaging platforms.

In Services Oriented Architecture, an Enterprise Service Bus is actually a compound pattern realizing seven other subpatterns. Someone somewhere knew what they were doing when they designed EventBridge because every ESB subpattern can be realized with an EventBridge feature. Here's how the features line up to the patterns:

SOA Subpattern	EventBridge Feature
Asynchronous Queuing	EventBridge can use SQS as a destination to queue events before sending to the consumers, as shown in this pattern: https://serverlessland.com/patterns/eventbridge-sqs
Event-driven Messaging	As a consumer, EventBridge has direct integration with over 130 AWS sources, plus a number of third party SaaS platforms. Direct integration with the API Gateway as well as a dirext PutEvents API allows for integration from custom sources. EventBridge supports 35 targets, including API Gateway destinations for custom integrations.
Intermediate Routing	A rules engine filters events and routes messages to one or more targets.
Policy Centralization	EventBridge has a schema registry, supporting either OpenAPI or JSON-Schema. Code bindings for several languages can be downloaded from the schema registry.
Reliable Messaging	EventBridge offers at-least once delivery, with retry and durable storage across multiple availability zones. EventBridge also features archive and replay of events.
Rules Centralization	Routing and transformation rules are localized to a single service
Service Broker	The Service Broker pattern is itself a compound pattern, composed of the Data Format Transformation, Data Model Transformation and Protocol Bridging patterns. EventBridge transformation rules realize Data Format Transformation and Data Model Transformation, while Protocol Bridging occurs mainly with API Gateway integrations

It's hard to understate the value of API endpoints. These realize Service Loose Coupling, which makes EventBridge completely agnostic to whatever language or application is being used by producers and consumers. There are no proprietary protocols, libraries or config languages beyond HTTP. This allows easy integration across applications, business units or even cloud providers.

Many purpose-built ESBs require a great deal of infrastructure and team members to support them, but EventBridge is serverless. There is no infrastructure to manage, and no capacity to plan around other than the limits documented at https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-quota.html.

EventBridge itself can be configured in a matter of minutes, and with its direct integrations to so many AWS services events can be flowing shortly thereafter. Accomplishing so much in a single iteration is a tremendous boost to systems integration.

In addition to its documentaion, EventBridge is also supported by a number of patterns on Serverlessland, at https://serverlessland.com/patterns?services=eventbridge.

With its speed to implement and ESB-fulfilling features, EventBridge deserves consideration alongside other messaging systems.

References

15 S3 Facts for S3’s 15th

Rich Dudley — Mon, 15 Mar 2021 01:28:28 +0000

To celebrate S3’s 15th birthday on 3/14/2021, and to kick off AWS Pi Week, I tweeted out 15 facts about S3. Here they are as a blog post, to make them easier to read. Because of the rapid pace of innovation in AWS services, including S3, so if you’re reading this in the future, some things may have changed.

S3 is designed for "eleven 9s" of durability. When you take into account redundancy in and across availability zones, in 10,000,000 years you'd lose only lose one of 10,000 objects. Read more at https://aws.amazon.com/blogs/aws/new-amazon-s3-reduced-redundancy-storage-rrs/.
S3 is region-bound, which means all S3 buckets in that region are partying in the same publicly available cloud ether. You can restrict access to a VPC but the bucket is still located outside the VPC. Related: https://cloudonaut.io/does-your-vpc-endpoint-allow-access-to-half-of-the-internet/.
S3 is a very versatile storage service. The trillions of objects it stores are the basis for many workloads, including serving websites, video streaming and analytics.
The return of INI files! With a first byte latency of milliseconds, S3 is suitable for storing configuration settings in an available and inexpensive way. Databases are no longer a fixed cost and there is no need for one just for configuration.
S3 is designed for "infinite storage". Each object can be up to 5TB in size, and there is no limit to the number of objects you can store in a bucket. Analytics aren't constrained by a file or disk size. It's like a TARDIS, or bag of holding!
How do you perform operations on hundreds, thousands or more objects? S3 Batch Operations allow you to copy objects, restore from Glacier, or even call a lambda for each file. For more information, see https://aws.amazon.com/blogs/aws/new-amazon-s3-batch-operations/.
S3 is a "consumption model", so you pay only for what you use when you use it. No more provisioning fixed-size network storage solutions with large up-front costs.
But what if you need massive object storage closer to your location? S3 on Outposts puts S3 on-premises, right where you collect or process your data. For more info, start at https://aws.amazon.com/s3/outposts/.
If your bandwidth is limited or non-existent, you can use Snowball Data Transfer to move TB to PB of data in and out of AWS. Learn more at https://aws.amazon.com/snowball/.
For data collection and object generation at the most extreme edges there is Snowball Edge Storage. Snowball Edge can even run processing workloads. Read more at https://docs.aws.amazon.com/snowball/latest/developer-guide/whatisedge.html.
Although you can upload files to S3 via the console, CLI and REST API, wouldn't it be great if you could just drag a file to a network share and have it appear in the cloud? With a File Gateway, you can do exactly that! See https://aws.amazon.com/storagegateway/file/.
S3 offers multiple storage classes, so you can optimize cost, latency and retention period. Standard offers the lowest latency but at the highest cost, while Glacier Deep Archive is perfect for yearslong retention. Read more at https://aws.amazon.com/s3/storage-classes/.
S3 Storage Lens is a central dashboard organizations can use for insight into S3 utilization and to get recommendation to optimize price. Read more at https://aws.amazon.com/blogs/aws/s3-storage-lens/.
S3 can version objects, so if you accidentally delete or profoundly update an object, you can recover from the most recent save or many prior versions, too.
S3 is a very secure service. IAM policies can be applied at the bucket and object level with a great deal of granularity. Additionally, VPC endpoints bind S3 traffic to a specific VPC only.

And one to grow on (for everyone): AWS recently released three new S3 training courses: https://aws.amazon.com/about-aws/whats-new/2021/01/announcing-three-new-digital-courses-for-amazon-s3/.

First Look: AWS Glue DataBrew

Rich Dudley — Tue, 29 Dec 2020 02:47:03 +0000

Introduction

This is a post about a new vendor service which blew up a blog series I had planned, and I'm not mad. With a greater reliance on data science comes a greater emphasis on data engineering, and I had planned a blog series about building a pipeline with AWS services. That all changed when AWS released DataBrew, which is a managed data profiling and preparation service. The announcement is at https://aws.amazon.com/blogs/aws/announcing-aws-glue-databrew-a-visual-data-preparation-tool-that-helps-you-clean-and-normalize-data-faster/, but the main thing to know is that DataBrew is a visual tool for analyzing and preparing datasets. It's powerful without a lot of programming. Despite its ease of use and numerous capabilities, DataBrew will not replace data engineers; instead, DataBrew will make it easier to set up and perform a great deal of the simple, rote data preparation activities, freeing data engineers to focus on the really hard problems. We'll look into use cases and capabilities in future blog posts. Spoiler alert: we're still going to need that pipeline I was going to write about, just more streamlined. Updated series in future posts.

DataBrew is not a stand-alone component, but is instead a component of AWS Glue. This makes sense, since it adds a lot of missing capabilities into Glue, but can also take advantage of Glue's job scheduling and workflows. Some of what I was planning to write involved Glue anyway, so this is convenient for me.

In this "First Look" post I'm working my way through the DataBrew screens as you first encounter them, so if you have an AWS account, it might be useful to open DataBrew and move through the screens as you read. No worries if you don't, I'll cover features more in-depth as I work through future posts.

DataBrew Overview

There are four main parts of DataBrew: Datasets, Projects, Recipes and Jobs. These are just where we start, there is a lot of ground to cover.

Projects

Holey moley there's a lot of stuff here! The Projects screen is where the real action is, and we'll spend a lot of time here in the future.

Sample View

As we explore the Sample View, it's important to keep in mind that DataBrew is meant for actual data preparation work, not just lightweight profiles. This sample view is kept to a small windows so we can explore the effects of transformations and monitor effects on quality.

The majority of this page is taken up with a sample of the dataset and some lightweight profiling, including the type, number of unique values in the sample, the most common values in the sample, and the first few rows of the sample. The sample size and position in the set can be changed. This sample view is a great way to test transformations and enrichments, which we'll look into later.

The profile view can be changed to explore the schema, which will be inferred from CSV and JSON files, or use the metadata in parquet or Glue Catalog.

The third profile view is correlations and summaries. If you've runs several profiles, the history is available to browse. The "missing cells" statistic is something we will revisit for the dataset I have loaded here. Also, for my sample dataset, the Correlation isn't that interesting because the majority of the columns are part of an address so they should correlate. But in other datasets, this could be really interesting.

The profile view also has data insights into individual columns, showing several quality metrics for the selected column.

Transformations

DataBrew currently has over 250 built-in transformations, which AWS confusingly calls "Recipe actions" in parts of its documentation.

The transformations are categorized in the menu bar above the profile grid. Transformations include removing invalid values, remove nulls, flag column, replace values, joins, aggregates, splits, etc. Most of these should be familiar to a data professional. With a join you can enrich one dataset by joining to other datasets.

Recipes

When we're in the Projects tab, and we apply a transformation to a column, we're creating a recipe step. One or more recipe steps form a recipe, and there isn't a published maximum number of recipes per dataset. Since each recipe can be called by a separate job, this provides a great deal of flexibility in our data prep processes. Recipe steps can only be edited on the Projects tab; the Recipes tab lists the existing recipes, allows for downloading of recipes and some other administrative tasks. Recipes can be downloaded and published via CloudFront or the CLI, providing a rudimentary sharing ability across datasets.

Opening a recipe brings up summaries of the recipe's versions, and the other tab on this page opens up the data lineage for the recipe. This lineage is not the data lineage through your enterprise, just the pathway through the recipe. My simple example here isn't that impressive, but if you build a more complex flow with joins to other datasets and more recipes, this will be a nice view. Although you can preview the datasets and recipes at the various steps, this is not a graphical workflow editor.

This is also a convenient screen to access CloudTrail logs for the recipes

Jobs

There are two types of jobs in DataBrew--"recipe" and "profile".

A profile job examines up to 20,000 rows of data (more if you request an increase). The results of a profiling job include:

data type
unique values count
missing values count
most common values and occurrences
minimum and maximum values with occurrences
percentiles, means, standard deviation, median
and more...

One feature missing in the Profiling is determining the pattern or length of text values. The Profiling results are saved in JSON format, can be saved in S3, and there is an option to create a QuickSight dataset for reporting. Anything more than QuickSight will require some custom processing of the JSON output. Although it took this long in a blog post to discuss profiling jobs, a profile is something which really should be created before building recipes.

A recipe job configures a published recipe to be run against a selected dataset. In a Dataset job we choose the dataset, recipe and recipe version we want to use.

The other recipe job option is is a Project job, which uses a saved project defined on the Projects tab. In this job, the only thing we need to configure is the project.

The original dataset is not modified in DataBrew; instead, we configure the S3 location, output file format, and compression for storing the results.

The output can be partitioned on a particular column, and we can choose whether to overwrite the files from the previous run or keep each run's files. Please use encryption.

Once configured, jobs can be scheduled. You can have a maximum of two schedules per job. If you need more than two schedules you'll need to create an identical job.

Either type of job can be run on a schedule, on-demand or as part of other workflows (see "Jobs Integrations" below). There is only one recipe and one dataset per job, so processing multiple recipes and/or multiple datasets would require additional workflow.

Jobs Integrations

Aside from the console or a schedule, how else can a DataBrew job be started? For starters, the DataBrew API exposes all the functionality in the console, including running a job. When coupled with lambdas, this exposes a great amount of flexibility in starting a job.

A second option is to use a Jupyter notebook (vanilla Jupyter, not SageMaker notebook yet) and the plugin found at https://github.com/aws/aws-glue-databrew-jupyter-extension.

Source Control Integration

Recipes and jobs have a form of versioning, but it seems to be S3 object versioning since there isn't a real source control workflow, but rather a new version is created with every published update.

However, as with most of AWS's online editors, there is no direct source control integration. The best you can do is to download recipes and jobs as JSON and check them in manually. Better than nothing but still surprising since AWS has CodeCommit.

Infrastructure as Code

At this time, neither Terraform nor Pulumi support DataBrew, but CloudFormation can be used to script DataBrew; see the https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/AWS_DataBrew.html for the API documentation and examples. The CLI is another scripting option, the documentation for the CLI is at https://awscli.amazonaws.com/v2/documentation/api/latest/reference/databrew/index.html.