Forem: John Preston

Kafka Connect Watcher - actively monitor your clusters

John Preston — Fri, 23 Jun 2023 10:59:15 +0000

TL;DR

Kafka Connect Watcher is a service that will actively monitor your connect clusters & its connectors, allow you to define automated actions to take and notify you when problems occur

Intro

Running Kafka connect clusters, is not particularly difficult, because it's been very well designed. It is orders of magnitude harder to manage the Kafka cluster(s). Furthermore, in 2023, there are several ways to deploy the workers & connectors: VMs, containers, and even managed services such as MSK Connect.

I love managed services, so I eventually looked into MSK Connect. However, at the time of writing, MSK Connect is A) very expensive B) is very limited in connection options.

"But mate, if it's easy to run a cluster, what's the issue?" I hear you ask. Well, let's dive into it, shall we?

The why

Let's start with a little bit of background: over the past few years, I have had the responsibility to deploy and maintain several connect clusters, and implemented CICD pipelines for developers to deploy their connectors.

Running the connect cluster has an infrastructure cost: run a few containers on ECS + Fargate (not going to get into Kafka costs, that's another issue on its own), and really that's where the complications can stop on the cluster. A connect cluster will run without any maintenance required.

Connectors however, that's a different story. Especially if, like me, to reduce cost and avoid having to re-invent the wheel, you come up with the idea to host a connect cluster and offer "connectors as a service" to your application teams.

The connect framework on its own offers, to my knowledge, only one way to extract monitoring metrics on the health of the connectors: JVM exporter. If you refer to this blog post which details how I achieved this, you will see that there is, technically, an easy way to collect data points on the health of your connectors.

However, this won't give you information such as, why it failed. And unless you go down the rabbit hole of a rather evolved logic, getting this information can be time consuming.

And sometimes, all a connector will take to work again, is to pause, restart the tasks, and resume it.

So to solve these things, I decided to write Kafka Connect Watcher

The How

Originally, I thought as a savy AWS engineer, I would use my existing CloudWatch metrics, notify a Lambda function, parse the alarm payload, and from there attempt connecting to the connect cluster. But then again, there are lots of information you'd need that the alarm in CloudWatch does not give you.

Some Kafka vendors, will tie you into their ecosystem tooling (yes, that includes AWS into that one). It's great if you are buying into it, or their supported partners, but the options of actions are limited and up to you to implement.

Ans so, that was the important thing to me, Kafka Connect is an open source framework, that works wherever you want it to run. So in that spirit, I decided instead to write a simple micro-service, in Python, for anyone to use and contribute to.

Now, you guessed it already, this service is AWS integrated and as it currently stands, will send notifications to SNS, report metrics to CloudWatch, but there will definitely be room for other integrations (webhooks are very versatile!).

How does it work?

Why in Python I hear you ask? Well, I had to write a python client library for Kafka connect to enable automation. So I decided to re-use that.

The service takes a configuration file that details

the connect clusters to monitor
- the connectors to monitor, using regular expressions to include/exclude specific connectors
- the actions to perform when a connector is not RUNNING
- the notifications to send when a connector activity occurs
The monitoring to enable and its respective settings

The connect watcher can monitor more than one connect cluster at a time, as to save on deployments if your architecture and configuration allows for it.

The configuration file uses a JSON schema for both input validation and help with documentation of the required fields, which I hope will help users.

Conclusion

This is a fun project which I enjoy doing and will happily work on implementing new features, so please feel free to open a Feature Request or submit your code changes.

Reduce costs & Improve peak performance with AWS Application Autoscaling scheduled actions

John Preston — Wed, 21 Jun 2023 09:48:07 +0000

TL;DR

Using AWS Application Autoscaling Scheduled Actions, you can prepare your resource compute capacity based on your needs, in combination to normal scaling rules.

Introduction

The cloud, a mystical place where you can pay for what you need and no more. One of my favorite principle with AWS is that, they give you all the tools, in the form of APIs and Services to be smart about spending your money.

You have probably seen many times already, customer demos showing how they handle peak time with fleets of machines when they are needed and scale-in to a minimum when usage is at its lowest.

Over the years, the features to enable customers to do these things have continuously gotten better. And many customers take great advantages of features such as SpotFleet and so on, as the majority of use-cases still require to run EC2 instances.

But what about what's not running on EC2?

Use-case

In the environments I work on, applications are deployed using AWS ECS on top of AWS Fargate. No more EC2 to manage, great, right? What about scaling DynamoDB? ElastiCache? Aurora?

Many of these (expensive) have a capability/feature to define scaling rules on one (or more) dimensions. For example, with AWS ECS, it's the service DesiredCount that will change (changes the number of containers). For DynamoDB tables, it's the table Read & Write capacity units, but that can also apply to all of the indexes of the table.

Each combination of resource and dimension constitute a Scaling Target.

How does it all work? It uses a service that is a lot less known: AWS Application Autoscaling. This service is responsible for monitoring Alarms & Metrics, and evaluate the value(s) against different rules you might have set in place.

Usually, the rules that are created will scale the dimension(s) in/out (or up/down depending on the resource) based on the current usage. Think of your usual CPU average across your EC2 Containers for example. You want the average to be maintained below 70%, and you gave a range of containers to have in the service (min and max).

And on a daily basis these rules are applied very accurately by the ever watching Application Autoscaling service.

But what about predictable workloads?

Scheduled Actions - the cron scheduler of autoscaling

All the resources dimensions will allow you to define a Min and a Max capacity. For example, min=1,max=10 Write Capacity Unit (WCU) & min=10,max=20 Read Capacity Unit (RCU)for a dynamo db table.

What scheduled actions allow you to do is to define a rule (your usual cron, a rate, i.e. every 2h, or at a specific point in time, but more on that one later) that will allow you to change the min/max dimensions of your resource

Yes, you read that right, it is not actually changing the value of say, the number of ECS containers. But changing the min and maximum range for that Scaling Target.

Now of course, if you change the minimum (say our RCU min = 15) where the current value might be at 10, then the RCU will now be 15 as the current value has to be in the range.

It works the other way around of course with the maximum. If you had 700 RCUs, and the scaling rule changes the range to max=100, the current value will get to 100.

If you have a website (news, shopping etc.) and you have put in place the analytics to understand when your user base connects to your site, you can then create scheduled actions which can predictably set your infrastructure up when you need it.

Where Application Autoscaling is very smart.

Your resource already has a scaling rule, say based on site usage, say average latency, in milliseconds. The way to maintain latency to a low value is by adding more ECS containers.

What's going to happen in the real world with scheduled rules?

Let's take the example of the news website that every day at 6AM starts to see its traffic go up (via the metric above). Say that metric reports 50ms at 5.45. There is a rule that says, above 100ms, double the capacity. That rule works regardless of the number of containers. We have defined we can have between 10 and 100 containers.

Without scheduled rules, we would have to wait for the latency to go above 100ms to trigger the scaling-out rule. The latency will then go down for a while. We have 20 containers now. Let's assume it goes up again, we now have 40 containers.

This will go on so long as the latency always goes up. When the latency goes down, we would remove containers, of course.

But customer satisfaction these days rely on a large part on having quite and responsive websites. So it is worth for the business, and they decide, that at a minimum, there should be 30 containers from 5.45 to 7AM.

We create a first scheduled action to change min = 30 at 5.45.
Then a second one, restoring the min = 10 at 7AM.

From 5.45AM to 7AM we are guaranteed that 30 containers are running. However, that doesn't mean that latency won't go up.

Let's assume it does, at 6.45AM, and so we now get 60 containers running.

Comes 7AM, the rule will only change the minimum, not the count. So at 7AM, we still have 60 containers, and possibly afterwards more, if the latency continues to increase.

What this guarantees us is, whenever after 7AM the latency goes down (say below 30ms) we don't need as many containers anymore. Before 7AM, we would have as little as 30 containers, no matter the value of the latency. After 7AM, if the latency allows for it, autoscaling will change the count down to the minimum, eventually 10.

Summary

For many services, the console will offer you to define scaling rules, based on a given input metric. However, there is no UI to configure scheduled rules, which in turn, makes this feature not a very well known one, yet one that is great to take advantage of. Equally, until recently (see my previous post), there was no way to create these rules via AWS CloudFormation.

But this is a feature available and you should definitely use it as soon as you can identify patterns that will allow you to save money ("I don't need these over night"), improve performance (repetitive/predictable behaviour), or both!

Head over to the API reference for more information.

Journey of creating a new AWS CloudFormation resource

John Preston — Fri, 09 Jun 2023 06:56:24 +0000

TL;DR

I created a new CloudFormation resource in the AwsCommunity organization with Python, created a new Troposphere resource for it, and now can use it in production.

The why

AWS CloudFormation (CFN) is my favorite Infrastructure as Code (IaC) service to use with AWS. However, some resources might not have a CloudFormation equivalent created by the respective product team, for various possible reasons.

So then, you have to either use another IaC tool (i.e. Ansible), create a CustomResource (lambda function invoked by CFN), or you could create your own AWS Resource and publish it for everyone else on AWS to use, with the AWS CloudFormation Registry

I wanted the ability to create ScheduledActions, resources managed by the ApplicationAutoscaling service, but there wasn't a resource for that. So I created it.

The how

The CFN team has created a set of tools and libraries in different languages to allow people to author their own resources.

I created a few myself before, and published these out of my own AWS Accounts. But this time instead of doing this all on my own, I had the opportunity to contribute to the Open Source community driven repository on GitHub. Great place to see what already existed, and have other people actually help and review my code.

If you are used the workflows of CloudFormation resources provisioning and the different stages it goes through, this is a very painless process. And to make the whole process even less so painful, you can (and definitely should) run contract testing locally using the sam cli which will validate that your code passes all the different use-cases it should.

Publishing your resource

As mentioned, being a seasoned CloudFormation user, creating a CFN template to publish my resource and create a StackSet to publish it in all AWS regions isn't difficult, but it can be very daunting.

Joining the AWS Community driven repository will remove that concern away from you as the friendly & helpful team at AWS have automated the whole process for you, and will be publishing your resource as part of the AwsCommunity "Organization".

Using the resource

Once the resource is published on AWS, you need only to activate it in your account. This will require you to create an IAM role to allow the resource to perform actions on your behalf. You can also, optionally, have logging output sent to your AWS Account, which can help with your audits and security posture.

After you have activated the resource, you can start using it in your CloudFormation templates, like any other resource. The resource will return attributes, so you can use functions like Fn::GetAtt, Fn::Ref or Fn::Sub and so on.

Here is a snippet of how the resource would be used in a CloudFormation template

AWSTemplateFormatVersion: "2010-09-09"
Description: Template to test AwsCommunity::ApplicationAutoscaling::ScheduledAction

Parameters:
  ScheduledActionName:
    Type: String
    Default: cfn-testing-resource

  ServiceNamespace:
    Type: String
    AllowedValues: [
      "ecs",
      "elasticmapreduce",
      "ec2",
      "appstream",
      "dynamodb",
      "rds",
      "sagemaker",
      "custom-resource",
      "comprehend",
      "lambda",
      "cassandra",
      "kafka",
      "elasticache",
      "neptune"
    ]

  ScalableDimension:
    Type: String
    AllowedValues: [
      "ecs:service:DesiredCount",
      "ec2:spot-fleet-request:TargetCapacity",
      "elasticmapreduce:instancegroup:InstanceCount",
      "appstream:fleet:DesiredCapacity",
      "dynamodb:table:ReadCapacityUnits",
      "dynamodb:table:WriteCapacityUnits",
      "dynamodb:index:ReadCapacityUnits",
      "dynamodb:index:WriteCapacityUnits",
      "rds:cluster:ReadReplicaCount",
      "sagemaker:variant:DesiredInstanceCount",
      "custom-resource:ResourceType:Property",
      "comprehend:document-classifier-endpoint:DesiredInferenceUnits",
      "comprehend:entity-recognizer-endpoint:DesiredInferenceUnits",
      "lambda:function:ProvisionedConcurrency",
      "cassandra:table:ReadCapacityUnits",
      "cassandra:table:WriteCapacityUnits",
      "kafka:broker-storage:VolumeSize",
      "elasticache:replication-group:NodeGroups",
      "elasticache:replication-group:Replicas",
      "neptune:cluster:ReadReplicaCount"
    ]
  MinCapacity:
    Type: Number
    MinValue: 0

  MaxCapacity:
    Type: Number
    MinValue: 0

  ScheduleExpression:
    Type: String
    Description: the cron(), rate() or at() expression.

  EndTime:
    Type: String
    Default: none
    Description: When using cron() or rate(), timestamp of when the rule expires.

  StartTime:
    Type: String
    Default: none
    Description: When using cron() or rate(), timestamp of when the rule starts.

  TimeZone:
    Type: String
    Default: "UTC"

  ResourceId:
    Type: String
    Description: The Scalable Target ID.

Conditions:
  NoEndTime:
    !Equals [ !Ref EndTime, "none" ]
  NoStartTime:
    !Equals [ !Ref StartTime, "none" ]

Resources:
  ScheduledActionForScalableTarget:
    Type: AwsCommunity::ApplicationAutoscaling::ScheduledAction
    Properties:
      ScheduledActionName: !Ref ScheduledActionName
      ServiceNamespace: !Ref ServiceNamespace
      ScalableDimension: !Ref ScalableDimension
      Schedule: !Ref ScheduleExpression
      ScalableTargetAction:
        MinCapacity: !Ref MinCapacity
        MaxCapacity: !Ref MaxCapacity
      StartTime: !If
        - NoStartTime
        - !Ref AWS::NoValue
        - !Ref StartTime
      EndTime: !If
        - NoEndTime
        - !Ref AWS::NoValue
        - !Ref StartTime
      Timezone: !Ref TimeZone
      ResourceId: !Ref ResourceId

Creating a troposphere resource and integration to ECS Compose-X

If you have read my earlier posts or heard me talk with Corey on Screaming in the Cloud, I am the author of a tool called ECS Compose-X. It's designed to support all the docker compose features, and allow for extensions that make it easier to deploy to AWS.

Because ECS Compose-X uses Troposphere, I was able to create a very light and simple python library(https://github.com/JohnPreston/troposphere-awscommunity-applicationautoscaling-scheduledaction) to distribute the resource for other Troposphere users to re-use.

Naturally, as autoscaling was already implemented in the project, this was a natural addition to the x-scaling services feature. This particular resource allows to plan for scaling your ECS service when you need to, and scale back down accordingly.

Conclusion

Creating AWS Resource for CloudFormation might seem like something like you shouldn't have to do, but I see it as a great way to contribute back to the community and to a service I love. And doing so, I hope, will help people who might trying to achieve the same thing.

But it is not limited to AWS resources: MongoDB, NewRelic, DataDog and other AWS partners have taken advantage of this capability to publish their own resources, allowing AWS customers to rely on AWS CloudFormation to provision their resources.

I am not a CDK user, but one could also now create a CDK extension to provision that resource in their code.

Big shoutout to the AWS CloudFormation team (@ericzbeard, @kddejong) & contributors for having helped me out and look forward to seeing more exciting resources and projects coming up!

Deploy Conduktor & a MSK Cluster in 3 commands

John Preston — Wed, 25 Jan 2023 09:42:08 +0000

Full blog post available at the Compose-X Labs

Use ECS Compose-X and the new x-msk_cluster extension to create new cluster/use existing ones, and connect your ECS services to it.

Run the demo in 3 commands

Install compose-x & the MSK extension

python3 -m pip install --user ecs-composex-msk-cluster

Download the compose file for the demo

wget https://raw.githubusercontent.com/compose-x/ecs_composex-msk_cluster/main/use-cases/conduktor.yaml

Deploy to AWS with ecs-compose-x

ecs-compose-x init && \
ecs-compose-x up -d templates -p msk-conduktor -f conduktor.yaml

This blog post uses Conduktor Platform as the example service given it has native IAM integration to AWS MSK and allows for testing connectivity & access to the Kafka Cluster.

You can log into the Conduktor platform using the defined username and passwords (defined in conduktor.yaml) and connect to http://<ecs_task_public_ip>:8080/

AWS MSK, Confluent Cloud, Aiven. How to chose your managed Kafka service provider?

John Preston — Tue, 22 Nov 2022 16:28:47 +0000

TL;DR

This blog post provides an overview of different managed Kafka service providers, including AWS MSK, Confluent Cloud, and Aiven. It compares their features, including cost, operational capabilities, and security, to help you decide which provider is best suited to your needs.

A little background.

I am by no means, a Kafka Guru: I haven't contributed to it, and I haven't any sort of certification or affiliation to it. All I am is a "power user" who has been using AWS for years and spent the past few years working with a managed Kafka service provider, giving me now plenty to compare practically.

Solutions/Offering comparison

In today's comparison, I am going to use AWS MSK and Confluent Cloud.

At times, I will also mention Aiven, but my credits ran out before I could explore all its features of it, so I recommend you explore that option yourselves too.

MSK serverless being new and of limited use-cases, due to its limitations by nature, I am leaving out of this comparison. Equally not considering Confluent "Platform" out of this comparison, given it's not a managed service.

Security

This is to me the first and most important criteria. Kafka becoming more and more popular, it is crucial to ensure that the information is secured, and access is restricted.

	AWS MSK	Confluent
Encryption at rest	* Default AWS encryption key * Use Customer encryption key (CMK)	* Can use KMS key, at premium cost * No details on default encryption
Authentication Methods	* SASL with PLAIN/SCRAM/IAM * TLS/SSL	SASL PLAIN
Audits & Broker logs	* Full audits with IAM * No audits without IAM, rely on broker logs * Broker logs available, long term persistence	* Audit logs for Kafka actions. Requires efforts to query * No access to broker logs

Encryption at-rest

AWS MSK, Aiven, and Confluent Cloud all support encryption on the fly and at rest. AWS MSK allows you to use your own AWS KMS Key (or CMK) to encrypt the cluster data at rest, with no restriction in computing size (however, MSK Serverless does not allow you to set that up).

Aiven does not seem to have an option (at least, not as per their console/wizard) to import your encryption key, regardless of the compute tier you select.

Confluent offers this option, but only at a premium cost: you must choose the most expensive compute option to support importing an encryption key. At the time of writing this article, that's an option only available to AWS customers.

But, the permissions that Confluent require you to grant to your CMK, are so wide open that technically they could be using it for anything they would like. When asked to list the services leveraging the key, no answer was provided.

Kafka authentication methods

Kafka allows for different authentication methods, each of them having pros & cons, but we won't get into that, but there is lots of material out there that would better explain it in detail than I could.

In a nutshell, you have the following Apache Kafka native with SASL:

PLAIN (username/password)
SCRAM (username/password)
OAUTH (OAuth2)
GSSAPI (Kerberos)
LDAP

Apache Kafka also supports mutual TLS (or mTLS), which uses certificate-based authentication.

With regards to authorization (what a given client can/can't do), Apache Kafka supports setting ACLs to grant selective permissions for each user. You have to use tools such as JulieOps or CFN Kafka Admin, or just the Kafka CLI/Admin API, to set these permissions.

Confluent Cloud only supports SASL_SSL with PLAIN (username/password) authentication.
Their concept of service accounts makes access management easy across multiple clusters.
But in the past year, the information provided to you via API/CLI breaks native Apache Kafka compatibility: the principal given for ACLs is not a valid one. You must therefore request or query the correct Kafka user principal to get things working.

Confluent also has its own "Roles"/RBAC driven access control layer, which is an attempt at making user-friendly the management of said ACLs.

AWS MSK supports more authentication methods than Confluent Cloud. It also implemented an IAM Native SASL mechanism, allowing you to use IAM credentials (Access/Secret Keys & IAM role-based tokens, etc.) to authenticate.

MSK goes even further, as you can also define ACLs via setting IAM policies that grant the users access to resources (topics, groups, etc.).
You do not need any additional tooling to provide your clients access to Kafka. AWS MSK with IAM provides you with fine-grain auditability as you can log these calls into AWS Cloud Trail.

Making a note that MSK with IAM is very useful and powerful, but, AWS needs to keep in mind that they must support Apache Kafka native authentication methods in their other services offering.

Audits

I haven't been able to evaluate that capacity with Aiven, but yet again I could not find any options in their "UI" provisioning to configure such an option.

Confluent Cloud has some audits, but these are provided to you in the form of a Kafka topic that they publish Kafka action events for you. You cannot specify where this audit topic is located. Because the logs are in a topic, you have to retrieve/export the data yourself into a data store to intelligently query these events. I have a S3 Sink connector which stores the data in S3 and use Athena to query the logs.
Confluent does not provide you with a Schema of the data, so I had to figure that out myself to make intelligent queries possible on fields.

As mentioned above, MSK provides that audit capability natively when using IAM to authenticate, but for other authentication methods, you will have to rely on the broker logs.

Speaking of broker logs, Confluent simply does not share these or make them available to you, period. That makes troubleshooting very frustrating. But I also see it as a means for them to do all sorts of operations and changes without you having any visibility over these.

AWS MSK offers to have Broker logs stored in 3 different destinations: CloudWatch logs, Kinesis Firehose, and S3. All these have pros and cons, but ultimately, the option is there.

Operational Capabilities

On security alone, I already have my preference. But let's look at another aspect that these days you simply cannot do without: operability - at least that's what I call it.

	AWS MSK	Confluent
Kafka Version	Can be selected by user	Selected by Confluent, no options.
Infrastructure as Code & API.	* Full AWS API to CRUD resources * SDK support for multiple languages	* API without OpenAPI spec * No Confluent maintained SDK
Monitoring	* Full monitoring of brokers * Auto-generated clients metrics * Open Monitoring with Prometheus & JMX	* High level cluster metrics * Heavily rate limited (80 calls/h)
Network availability	* Private & Public access	* Private & Public access Limitations on options when using private networking

Kafka version

With Confluent Cloud, you cannot choose. They pick the version, run it for you, and all you get to know is the compatibility level.
According to their website, they run the same version as what's available in "Confluent Platform".

With AWS MSK, you get to choose which version you can use. In a way, it makes you responsible for choosing said version and knowing the difference from others. But equally, if you were in the process of migrating from a self-hosted cluster for example, that allows you to ensure that compatibility will be the same for your clients, limiting risks.

Some versions give you access to additional features, such as "2.8.2_tiered" version which allows you to configure tiered storage, for additional cost savings.

Infrastructure as Code & API.

As most vendors do these days, they have and maintain a Terraform provider for these. AWS MSK also has CloudFormation support (and therefore CDK/Troposphere support).

All three vendors also have a CLI that allows them to provision resources.

And all three vendors have an API, although AWS has a clear lead in maturity and security for it. And AWS maintains an SDK for nearly every language relevant to this century.

AWS never creates a service without an API for it. Confluent, however, had an API but only recently got into a "mature" state. They have a Go "library", but that's the extent of it.

I created a CloudFormation resource for Confluent Cloud, to manage my service accounts that way. I also have a Lambda function that is used to perform Confluent SASL credentials rotation.
Both these things, lead me into creating a Python SDK to manage Confluent Cloud, which mostly catered to my immediate needs. But the development of said API was slowed down by the state of the API before it went "GA".

Monitoring

We have already gone over logs & audits, so we are going to focus on "metrics".

Confluent Cloud being very secretive, you cannot access the JVM of the Kafka clusters, sadly, that results in very limited capabilities for monitoring. Confluent Cloud does offer a telemetry API, that you can use to request exporting data in a Prometheus format, but the API itself is very heavily rate-limited. So you have to make sure you are not going to make too many queries.
This further limits some operational abilities, such as getting a close-to-real-time set of metrics, such as your consumer groups' lag.

Overall, I found the monitoring capabilities of Confluent Cloud to be too limited, and I had to deploy other services, such as the excellent kafka-lag-exporter to get operationally relevant metrics.

AWS MSK is getting metrics all around(cluster metrics, consumer metrics, etc.), stored for free in AWS CloudWatch. That allows you to implement alarms using the native tools of the AWS eco-system, and trigger all sorts of actions (autoscaling your services, alarms, and so on).

It also supports to export of your metrics from the JMX in the Prometheus format, allowing you to scrape the brokers/nodes for additional information or export it to another metrics storage system.

Cluster evolution & operations

The Confluent Cloud offering gives you a level of granularity on the compute size and storage capacity of the cluster. With their "Dedicated" offering, you can choose the amount of "Confluent Kafka Unit", or CKU, to match your business needs. But there is a catch: if you want a multi-AZ cluster (redundant within a region), you must use at least 2 CKUs. That brings the costs to a significant amount, regardless of whether you do need that capacity or not. Combining that with the security encryption requirement, forces you to use their Dedicated offering.

As it is a managed service, you do not get to perform any operations such as rebalancing partition leaders and so on.
You have to trust that Confluent will be on top of infrastructure issues to perform these operations reliably for you. Also because it is a managed service, and the computing unit offuscates things for you, you don't get to know how much actual computing the cluster uses. Confluent provides you with an "average load" metric, and that's all.

You can also not make any settings changes, such as changing the number of in-sync acknowledges, and generally speaking, any default or cluster-level settings.

With AWS MSK, the number of brokers is driven by the number of subnets you deploy your cluster into the number of brokers must be a factor of that number. I assume that it is to guarantee that you get 1 broker per zone - if you decided to place all your brokers in subnets using the same zone. You can choose the compute size of your brokers, but you must be wary that some features are not supported on all broker sizes.

You can create MSK Configurations that allow you to define cluster level settings, fine-tune these for your use-cases, and associate these with your MSK Cluster.

In terms of storage, Confluent Cloud will read an "unlimited" amount of storage, whereas AWS MSK can auto-scale the storage capacity of each broker, from 1GB to 16TB. Both allow adding more brokers, although technically with Confluent, you are changing the number of CKUs.

Network availability

Both Confluent & AWS MSK allow having clusters hosted publicly or privately. But not both.

It is important to note that Confluent Cloud requires extremely large CIDR ranges - at the time of writing - if you are looking at connecting to these via AWS Transit Gateway or VPC Peering, making the legacy integrations of existing large IT networks near impossible.

This leaves you, for AWS users, with either VPC Private Link or public access. Considering latency and costs (public traffic being 48 times more expensive per GB via a NAT Gateway). Private Link only works one way, so if you were planning to use Confluent-managed connectors, a lot of these are off the table right away.
The way Confluent implemented network ingress on their end also will deny you multipathing: to get from your client to a broker, you must use the endpoint in the same AZ. Any attempt at using an alternative endpoint will be dropped.

Ecosystem

Confluent Cloud offers some features only available in Confluent Cloud and only possible among clusters hosted by Confluent cloud (although these are very specific and somewhat limited).
They have KSQL as a Service and some connectors. But these are yet again limited in number and/or security options. Not all options supported in the S3 sink connector for example are available in the Confluent cloud.

But for the customers out there not on AWS, Confluent & Aiven can make a very compelling offer.

AWS MSK integrates natively, thanks to its IAM authentication method, to a lot of various other AWS Services. The number of services that you will be able to integrate with MSK is only going to go up.

If you wanted KSQL-like capabilities, you can use a service such as Kinesis Data Applications, which is a managed Apache Flink cluster and has similar semantics and capabilities as KSQL.

They both have a managed Schema Registry service which will allow your application teams to store data schemas, which will help tremendously on your data-as-a-product journey.

Pricing

With both Confluent & AWS MSK, you have a model of pay-as-you-go which makes it very easy to get started with and scale as your needs do.

If you get in touch with the Sales team of Confluent, you might be able to get a discount based on volume and length of contractual engagement, classic IT style.

It is worth noting that having a paid subscription to Confluent Cloud can also get you a License key that will allow you to use some of the Confluent services which are under Confluent licensing. Although often there is a truly open source alternative to the Confluent "purchased" feature, worth considering.

Technically, you can get a smaller & cheaper MSK cluster with all the bells and whistles for security (encryption, audits, etc.), whereas to get all the options available with Confluent cloud, your costs will be higher by quite a factor.

Because AWS API & Kafka's API are both so rich, one could imagine implementing further logic such as binding consumers to partitions for which the leader is in the same zone as the broker, reducing cross-az traffic costs. Enabling tiered storage with MSK can also lead to further reduce the storage requirements.

Conclusion

In the Kafka world, competition on getting the best offering is fierce, with each vendor contributing to Kafka in their very own way. On different aspects, I sincerely wish for MSK & Confluent, as well as anyone involved in improving the Kafka ecosystem, to work together, progress KIPs along, and not forget the root of Apache Kafka is with the Open Source community. Implementing features that work within their ecosystem is a fair and logical business decision. And so long as the Kafka users come first, choosing your Kafka vendor should only be a question of features that meet your business requirements.

As a long-term AWS user, I think that MSK is only going to add more and more features that directly serve customers with their operational capabilities, as their features focus is always on the customer & security, first.

If you are an AWS user today and are heading towards micro-services architectures, where each application has its own set of permissions, using AWS MSK with IAM authentication is a no brainer and will get you up and running extremely fast.

In contrast, to do this with Confluent, who has very limited automation around creation of Service Account, SASL credentials, and operational capabilities, you will end up creating a few credentials, likely shared among different applications. To stay secure, this requires a lot of discipline and a very good company-wide strategy & maturity.

With the creation of MSK Serverless, MSK Connect, and integration with AWS Glue Schema Registry, the wealth of ETL services that AWS has not only makes Kafka a part of it, it empowers it and gets you into a future proof position. There is only so much other vendors will do that will get you further than having a managed hosted Kafka cluster: you will still have to do everything else yourselves.

So if you were undecided as you started reading, I hope this guide has guided you to a decision.

ECS Anywhere & Traefik Proxy with ECS Compose-X

John Preston — Mon, 14 Nov 2022 17:58:14 +0000

Original post can be found here along with the technical resources

TL;DR

Using ECS Compose-X, deploy Traefik Proxy on-premise with AWS ECS Anywhere with only a few changes from running on AWS EC2 or AWS Fargate.

Introduction

Our tools for today's lab

ECS Compose-X is an open-source project that allows you to use docker-compose services definitions, and render CFN templates (just like with AWS CDK, but without having to write code) to deploy your application service stacks.

Traefik Proxy is an open source project that will allow you to define ingress rules for your applications and will automatically route traffic to your backend services based on various rules. It is also capable of doing Service Discovery, and today we are going to look at the ECS & ECS Anywhere discovery providers.

AWS ECS Anywhere is an extension to AWS ECS, which is a managed container orchestration service, that now allows you to run your workloads in your datacenter/on-premise, and really just, anywhere!

The objective

When running on AWS, we have access to services such as AWS Certificates Manager (ACM), AWS Load Balancing (manages ELB, ALB, NLB and more), which can offload a lot of complexity and is very feature rich.

However, coming to on-premise environments, the costs for hardware that would give us the same functionalities (think F5 load-balancers, your expensive licensed VXLAN resources), are only affordable by a few. And typically for a "home-labber" such as myself, way out of my budget.

So I needed an alternative solution that would allow me to use AWS ECS services, route traffic to my services based on service discovery. It should also be able to deal with managing SSL certificates for me. And finally, I must be able to deal with non-persistent storage.

Welcome Traefik Proxy

For years, I have been an NGINX and/or HA Proxy user. They are very lightweight, very popular, great documentation and community support in general.

But, they aren't quite capable of doing Service discovery all by themselves.

I came across Traefik Proxy, and a whole new world of capabilities was now wide open. With service discovery providers, Traefik can scrape your services and based on labels/tags, identify instructions to perform. And AWS ECS is one of such providers.

Just a tiny little problem

When I first tried Traefik a little over a year ago for ECS Anywhere, it wouldn't work. That's because until then, Traefik only considered using Fargate or EC2 instances to run the containers. There was no implementation of discovering AWS ECS Anywhere on-prem instances.

This has been since addressed, and one can specifically enable the ECS Anywhere discovery in Traefik.

Traefik and Let's Encrypt SSL management

When you define routers with Let's Encrypt, you can define whether or not you want Traefik to provision certificates.

With Traefik, you can automatically get new certificates for yourself when you need them. There are different validation methods, my chosen one being with DNS validation.

For validation, given my DNS domain is managed in Route53, I simply indicate to Traefik to use that DNS method / zone for validation.

Why DNS validation works for me?

If I have internally exposed services (not available on the internet), but I still want to have SSL certificates provisioned for them, DNS is the only option for that. It will generally come down to your preference.

Deployment

Prerequisites

You will need

An AWS Account
Configured a local user with IAM permissions to deploy resources
Have an existing ECS Cluster with a registered ECS Instance that runs on-premise.
Installed ECS Compose-X (version 0.22 and above). See below.

Compose-X install

You can install it locally for your user

pip install pip -U;
pip install --user ecs-composex

or install it in a python virtual/isolated enviroment

python3 -m venv compose-x
source compose-x/bin/activate
pip install pip -U
pip install ecs-composex

Once you have installed it, run the following command that will ensure we have the necessary settings and resources to get started.

ecs-compose-x init

Clone the labs repo

Clone the repo, and head to the configuration files.

git clone https://github.com/compose-x/compose-x-labs.git
cd traefik/part_2/

In the current files, you will have to edit to change the domain name in-use.

You can either edit it with your preferred IDE, or simply run

sed -i 's/bdd-testing.compose-x.io/<your_domain_name.tld>/g'

If your domain is not maintained in AWS Route53, you will need to head over to the Let's Encrypt ACME documentation in order to use a different validation method.

Getting ready to deploy

The deployment to ECS Anywhere is only a command away

CLUSTER_NAME=MyExistingECSCLuster ecs-compose-x up \
-n traefik-anywhere \
-d templates \
-f docker-compose.yaml \
-f ecs-anywhere.yaml

Compose-X will render all of the CFN templates and store them in your local folder (under templates), as well as in AWS S3. It is required to be in S3 for CFN nested stacks.

After a few minutes, you should have running on your ECS Anywhere instances, Traefik.

Adding SSL Certificates backup.

Let's Encrypt "production" endpoint, has a rate limit in place for the number of certificates requested per domain.

So if you are new to this, we recommend to use the Let's Encrypt staging environment, which will allow not to hit the rate limit.

Sadly, it seems that the persistent storage of the file that holds the SSL certificates requested by Traefik to Let's Encrypt is not a feature that we might see coming in any time soon.

So instead, we are going to implement the backup-and restore ourselves.

Using 2 sidecars, one to restore the files prior to traefik starting, and another constantly watching for a change to automatically backup the file to AWS S3, we will ensure that we don't request certificates we already did provision before.

To deploy the solution, we added the backup.yaml file to our deployment command.

Note: the S3 bucket already exists for us, and if you want to use an existing one, you will need to adopt the Lookup Tags in order to use your own/the right bucket.

So now, we deploy our updated definition to AWS

CLUSTER_NAME=MyExistingECSCLuster ecs-compose-x up \
-n traefik-anywhere \
-d templates \
-f docker-compose.yaml \
-f ecs-anywhere.yaml
-f backup.yaml

Hint: the order of the files does matter.

And that's it! You now have successfully deployed Traefik to ECS Anywhere, with automated backup & restore for your certificates.

To add additional services you wish Traefik to route to, simply deploy them with the appropriate labels, just like we used in the demo for the whoami service

AWS ECS Anywhere - Is it worth it?

John Preston — Tue, 25 Oct 2022 13:02:55 +0000

TL;DR

AWS ECS Anywhere is a very powerful extension of AWS ECS, which makes a very competitive argument in many aspects for hybrid-cloud, and even multi-cloud. It is very cost competitive to other alternatives such as Virtual Private Servers (VPS) and alternative AWS services.

Recently I spent a lot of time working on ECS Compose-X, Traefik, and AWS ECS Anywhere.

Whilst most of the work I have been doing with these, has been in the context of a Home Lab, a lot of considerations for it are just as true for large production ready deployments in an enterprise context.

Whilst working on all this, a thought came to mind: what about the costs? Is it worth it to run ECS Anywhere?

So here are some thoughts and cost considerations to try to come to a decision on this.

Alternatives

When one looks at container orchestration on-premise, there are a lot of different resources, and software to achieve that. So why would one consider using AWS ECS Anywhere over these?

Many blog posts about using Kubernetes (K8s) on-premise, and running a cluster yourself can be very cheap, or very expensive, depending on the hardware you use to run it and operational costs knowledge & expertise required.

Let's say to stay in the realm of managed services for now.

If we were wanted to have K8s, AWS EKS Anywhere is an option. But, the cost for it is very high, and that's just for the control plane. Add on top of that, the upfront costs for the hardware and infrastructure to run it, and so on. Certainly worth considering for big corporations, but for someone to play around with in their Home Lab, it is something most of us can't afford.

ECS Anywhere pricing

Looking at the costs of ECS Anywhere: $0.01025 per hour per instance.

This means, $7.626 a month for each on-premise registered instance running.

I say running, because if your ECS instance is not marked as such, it does not cost you anything. This can be very handy if you can automate turning on/off VMs / Hardware at home, when you need more compute for processing. In addition to my Pi4, I have an old Intel NUC, which I only turn on to test x86 based applications which don't have ARM support. So I barely pay anything for it.

On the plus side, neither ECS nor EKS Anywhere has a cost based on the number of containers you run, or the hardware running these (i.e. licensing per vCPU/RAM).

So with my humble Raspberry Pi 4 and 4GB of RAM, I can very comfortably run several services before getting into
hardware limits (it depends, of course, on the workload).

Using CPU/RAM limits and reservations, one can ensure that applications get priority and have enough capacity to run, and we recommend that you set these wherever possible.

If you want to have enhanced features and monitoring of the on-premise instance as well, you can enable that with SSM, at an additional cost, but that is optional.

Let's compare it also with VPS services. Of course, my Piv4 does not match exactly either the VM or Container profiles.

So let's take a look at the closest options with AWS LightSail (not accounting for local storage or bandwidth, that's up to the infra/setup, more on that later):

VM, 2vCPUs, 4GB of RAM -> $20 a month
VM, 4vCPUs, 16BG of SAM -> $80 a month
Containers, 2vCPUs, 4GB of RAM -> $80 a month
Containers, 4vCPUs, 8GB of RAM -> $160 a month.

With digital Ocean Droplets, the closest in CPU RAM would be $24 or $48.00 a month. Again this is without considering the bandwidth and local storage. Other vendors have prices that vary whether you pay annually or monthly, so the comparison sheets start to become quite a bit of a mess.

Now, for fairness, let's add the electricity costs for the Pi v4: 5V*3A = 15W => $4 to $5 a month (varies based on your electricity costs).

If you also add the costs of electricity for your router/switch, your ISP subscription and so on, it can accumulate to an expensive bill altogether.

But, you are not planning on not watching Netflix? Browsing? Working from home, are you?
So in a way, these additional costs are "what they are" and you would pay for them whether you run ECS Anywhere or not, in the context of a Home Lab.

In the context of your data center however, with the cost of rack rental, the power supply, and connectivity are usually already factored in as well when you bought that racking space.

So I am only going to factor in the costs of the Pi at its peak power usage. In total, estimate about $14 a month, for an unlimited number of services to run (within the boundaries of available capacity). And you get all of the IAM benefits, from a security point of view.

It is important to note, for fairness, that I am fortunate to have a 1GB fiber symmetrical connection, with unlimited data. So bandwidth is not a problem for me. But if you needed better bandwidth, and maybe features like multiple IP addresses, you might have to go for a VPS provider.

Conclusion

So all together, ECS & ECS Anywhere make for a very appealing option when it comes to in-cloud and on-premise container orchestrator & managed service. It comes with a rich set of capabilities, from logging to monitoring, scalability, and observability.

Even more so important, from an operational point of view: the configuration of ECS Task definitions and Services is nearly identical: only a few settings change.

Some features, such as Service Discovery using AWS CloudMap, however, are not available. And this is where we rely on AWS' APIs for discovery, in the way Traefik and many other services alike, do.

I used to run some small services in LightSail, and although it made some management aspects easier, I am delighted to run services with ECS Anywhere for a lower cost. And it forces me to think about running applications in a not-as-perfect environment (hardware-wise) than AWS.