Forem: Michael Ade-Kunle

Publish-Subscribe: Introduction to Scalable Messaging

Michael Ade-Kunle — Mon, 07 Sep 2020 10:26:10 +0000

The publish-subscribe (or pub/sub) messaging pattern is a design pattern that provides a framework for exchanging messages that allows for loose coupling and scaling between the sender of messages (publishers) and receivers (subscribers) on topics they subscribe to.

Messages are sent (pushed) from a publisher to subscribers as they become available. The host (publisher) publishes messages (events) to channels (topics). Subscribers can sign up for the topics they are interested in.

This is different from the standard request/response (pull) models in which publishers check if new data has become available. This makes the pub/sub method the most suitable framework for streaming data in real-time.

It also means that dynamic networks can be built at internet scale. However, building a messaging infrastructure at such a scale can be problematic.

This introduction to the pub/sub messaging pattern describes what it is, and why developers use it, and discusses the difficulties that must be overcome when building a messaging system at scale.

The Ably realtime platform uses the publish-subscribe pattern at internet scale for delivering messages in real-time.

What Is Pub/Sub? Loose Coupling and Scaling

In the pub/sub messaging pattern, publishers do not send messages directly to all subscribers; instead, messages are sent via brokers. Publishers do not know who the subscribers are or to which (if any) topics they subscribe. This means publisher and subscriber operations can operate independently of each other. This is known as loose coupling and removes service dependencies that would otherwise be there in traditional messaging patterns.

Pub/sub is different from the standard request/response models in which publishers (pull) to check if new data is available. This makes the pub/sub method central to effective streaming of data in real-time.

The pub/sub pattern allows extremely dynamic networks to be built at scale without overloading the publishing components or causing unnecessary costs. However, there are difficulties associated with scaling and different ways of getting around these difficulties that need consideration.

Typical uses of the pub/sub pattern include event messaging, instant messaging, and data streaming (such as live-streaming sporting events). Pub/sub is also used for workload balancing and with asynchronous workflows.

Communication infrastructure for a pub/sub system (Diagram adapted from msn).

A Background to Messaging Systems and Pub/Sub

A simple information system can follow a simple pattern: input–processing–output. At a reasonable scale, the system will need multiple input and output modules for handling concurrent requests. A problem then arises of routing messages from input modules to their respective output modules.

To solve this routing problem, the input and output modules need an addressing mechanism. It is the processing module’s job to route them to the correct recipient based on an address.

At internet scale, the publish-subscribe pattern can handle tens of thousands of concurrent connections.

At internet scale, the system will handle thousands or even tens of thousands of concurrent connections. It needs to also be capable of handling high volume and global geographical spread of users.

At such a massive scale, the system needs to solve the following problems:

Because of the high volume and geographical spread, the load needs to be distributed between multiple processing modules.
Predefined addressing between the modules becomes a huge overhead.

In short, the problems come down to minimizing the shared knowledge of addresses. Pub/sub solves the problems by using a data pipe through which modules can post and retrieve their messages.

The modules do not need to maintain shared knowledge of the whereabouts of other modules. The input modules only accept user input, processing modules only process the data, and the output modules only display the output.

In pub/sub, there is one channel for posting messages and one for retrieving. It happens in steps like this:

The input module will gather the user input and post the message in the preprocessing channel.
The processing module will pick the messages from this channel, process it and post it to the post-processing channel.
Lastly, the output module will collect the message from the post-processing channel and display it on the users’ screens.

The same pattern works at any scale.

In pub/sub messaging pre- and post-processing of the messages is used to address routing problems at internet scale.

Why Developers Use Pub/Sub

A logistics company, in theory, would typically have a mix of customer data and generic data and a highly variable customer load. The data channels between the customers, the drivers, and the delivery office may also be unreliable. It is important that subscribers receive all of the messages customers are sending, but it is not necessary to know about the customers or how many there are.

It is also important that the company does not over-provision their service (which would be costly), or over-provision load balancing, which would add extra complexity and be detrimental to the performance of the network.

It is important to remember that the pub/sub pattern is suited to conveying information whose relevance fades fast. (What is the score now? And now?) As information is frequently replaced, there is no pressing need to store it. Usually, it is enough to keep the most recent message, or enough information to recreate a view of fairly recent events.

Developers use pub/sub to take advantage of edge computing and the network backbone:

Edge computing allows you to scale the system at the edge. This is where scaling is easier to implement and also where it is most cost-effective.
Using the network backbone and multiple points of presence means message delivery can be much faster and more reliable.

How Pub/Sub Is Adopted in the Real-World

Event messaging: pub/sub powers many realtime interactions across domains like EdTech, B2B platforms, and delivery logistics. As we shop online more frequently for a wider variety of goods, package delivery has become commonplace. Logistics companies need to use delivery resources more efficiently. To optimize delivery, dispatching systems need up-to-date information on where their drivers are. Pub/sub event messaging helps logistics companies do this.

Dispatchers need to access drivers’ location information on demand, ideally continually. Having this data at the ready allows them to better predict arrival times and improve routing solutions. Dispatching systems also send out information such as cancellations, traffic information, and new package pickups.

As the day goes on, this information becomes more critical. It gets harder to maintain delivery time windows, and schedule adjustments must be made to maximize the number of on-time deliveries.

This is a lot of data, and not all of it is relevant at any given time. To get around this problem, devices need to be able to subscribe to updates that matter to them. With a pattern like pub/sub, all parties only subscribe to whatever is relevant to them:

Driver devices can subscribe to traffic and route information.
Dispatching and ERP systems can subscribe to the completed delivery updates.
Tracking and dispatching systems can get live position updates when they need them.

These systems enable customers to track deliveries in real-time. For example, reschedule any package in transit, and to alert drivers that there are pickups to be made en route, to allow for more effective routing, which reduces fuel costs and improves efficiency.

Other use-case examples include:

Instant messaging: Service that provides near-instantaneous interaction, for example, a notification that the person you’re conversing with is typing.
Data streaming: Applications can provide data instantly to clients for processing, saving or live preview. For example, providing the latest match scores in a tennis tournament and making sure they are available to a new website visitor the moment the page loads. See the Tennis Australia case study.
Workload balancing: Knowing capacity and location of parts of a system allows for better utilization of effort. This includes, for example, allowing logistics dispatchers to use partly empty delivery vehicles for pickup and on-demand delivery.
Asynchronous workflows: As an example, think of factory machines and power, water, and other utility sensors updating central control systems live. Improving the efficiency of the supply chain allows for just-in-time manufacturing, and capacity control.

Pub/sub code examples

Here are two examples of pub/sub applications with code snippets.

Faye

Faye is an open source system used by Aha! Roadmap software and Shopify. It is based on pub/sub messaging. The following code sample shows how to start a server, create a client, and send messages:

Ably Realtime Chat App

Here is an example of how you might add pub/sub functionality to a chat app using one of Ably’s Realtime SDKs.

When the app launches, the SDK initializes and subscribes to the topic that represents a public chat room.

Subsequently, when the user wants to send a chat message, the chat app publishes the message on the same topic.

The app unsubscribes from the channel when the user logs out or leaves the chat room.

What to Consider When Pub/Sub Is Deployed and Scaled

It is straightforward to implement a single-channel pub/sub messaging framework. But when you start to scale, the classic problems of distributed systems engineering emerge. When scaling to multiple channels and increasing complexity to any significant degree, the problems increase, and maintaining reliability becomes difficult.

The Problems of Building a Messaging System at Scale

Distributed messaging systems should ideally have the three elements of reliability, speed, and ordering. However, it’s usually the case that you only get to have two of them. To create a system that allows all three, you have to start at the design level with a watertight mathematical model. It is just about impossible to add in the missing third element later.

These are the problems to deal with:

Ordering of messages. As you start distributing messages over a large network, problems arise with reliably reconstructing the order in which the messages are meant to be delivered. To be reliably fast, you have to send messages using multiple routes in parallel, but you also have to be able to re-order and maintain their original sequence.
Queuing and auto-persistence of messages. For fault-tolerant, reliable messaging you must build in auto-persistence — otherwise reconstruction is impossible if a system goes down and the records vanish. If you don’t queue messages you can’t reliably reconstruct an order, or handle fluctuations in bandwidth.
Send exactly once. To send a message once and for it to be received only once at its required destination is a classic problem. If you don’t know who is receiving the message, it has to go everywhere. You have to have logic either in the network to stop it from arriving twice; or in the application to stop it from being processed twice. Otherwise, you might trigger an event twice with unintended consequences. For example, while making an online payment a user is disconnected and quickly reconnects. If exactly-once semantics are not supported, the user can end up getting charged a second time when they reconnect.
Distributed storage. Fault tolerance requires multiple points of redundancy, failover storage, storage in different physical locations, and auto-healing networks. True reliability requires redundant physical hardware along with multiple cloud instances. The trade-off with such redundancy is complexity vs. security and safety.
Load surge and slowdown. Actively scaling a very transient load dynamically, allowing quick scale-up and slower scale-down, to maintain a fair and available network for users.
Rate limitation: Fair workload balancing is complicated. When your system becomes complex you need to consider how to manage customer usage. You have to provision service capacity for different customers fairly, without imposing hard limits.

These are all problems of building a system at scale. Because you don’t necessarily know all the information you might need about your system at any given time, either the framework needs to be clever enough to handle it, or all the applications in your system need to be quite advanced.

Ably balances the above concerns through judicious use of the TCP layer. By generating multiple paths, we gain reliability but without the expense of speed — we can do fast pathing because we control the path we follow. Also, because of the way the network is set up we can maintain ordering, which is often lost in the trade-off with speed of delivery.

This is baked in at the design stage, because the problems that arise when building in a global framework are almost impossible to correct at a later stage.

SaaS or Self-Deploy?

You can either build a pub/sub messaging infrastructure yourself (self-deploy) or adopt a cloud native Software-as-a-Service (SaaS) infrastructure, such as Ably.

Solving the design considerations of building a globally scaling system is far from easy for reasons described in the previous section. Building your own messaging system requires budgeting for more design upfront.

If choosing to self-deploy, there are also considerations such as infrastructure setup, installing, and framework configuration. Doing these yourself gives you oversight of building the features you want in your system, but is also time-consuming and expensive.

The advantages of “as-a-service” pub/sub infrastructure over self-deployment are:

Reduced development time. Pub/Sub isolates application development from the messaging infrastructure.
Managed infrastructure is preconfigured. System tuning, security and design considerations are costly and time-consuming.
Programming options. Managed services support popular programming languages and frameworks. On the other hand, message broker frameworks support only a few languages. Building and maintaining SDKs for your own message broker is a diversion of development effort and time.
Skills. Hiring distributed systems engineers is difficult. If putting together a systems engineering team becomes part of your core infrastructure, you then have to maintain their skill set.
Cost. Most SaaS business models offer controllable levels of expenditure. You pay according to your needs and usage. Although it might seem cheaper to self-deploy, this hides the amount of investment required to build, run, and maintain the software. Your cloud bills are not the only expense.

Publish-Subscribe at Ably

Ably is an enterprise-ready pub/sub messaging platform. We make it easy to efficiently design, quickly ship, and seamlessly scale critical realtime functionality delivered directly to end-users. Everyday we deliver billions of realtime messages to millions of users for thousands of companies.

We power the apps that people, organizations, and enterprises depend on everyday like Lightspeed System’s realtime device management platform for over seven million school-owned devices, Vitac’s live captioning for 100s of millions of multilingual viewers for events like the Olympic Games, and Split’s realtime feature flagging for one trillion feature flags per month.

We’re the only pub/sub platform with a suite of baked-in services to build complete realtime functionality: presence shows a driver’s live GPS location on a home-delivery app, history instantly loads the most recent score when opening a sports app, stream resume automatically handling reconnection when swapping networks, and our integrations extend Ably into third-party clouds and systems like AWS Kinesis and RabbitMQ. With 25+ SDKs we target every major platform across web, mobile, and IoT.

Our platform is mathematically modeled around Four Pillars of Dependability so we’re able to ensure messages don’t get lost while still being delivered at low latency over a secure, reliable, and highly available global edge network.

Developers from startups to industrial giants choose to build on Ably because they simplify engineering, minimize DevOps overhead, and increase development velocity.

Alternative Further Reading

Now that you know about the basics of pub/sub, find out more about

Also, read more about messaging design patterns, and realtime technologies in general. Or jump in and try sending and receiving some messages with the Ably platform.

Ably secures $7M to set a new standard in realtime edge messaging

Michael Ade-Kunle — Wed, 08 Apr 2020 14:01:18 +0000

Ably has closed a $7M Series A funding round led by MMC Ventures, with our seed funders Forward Partners co-investing. This funding enables us to deliver on our mission to provide the new standard of realtime edge messaging infrastructure required for a connected future.

Consumers today, more so than ever, expect their digital experiences to be realtime, yet as you know it can be incredibly difficult to deliver that. Ably solves this problem by providing simple APIs that developers can depend on, so that they can get on with building their apps and services.

This funding comes as a result of huge milestones for Ably:

We’ve streamed 1.5 trillion realtime messages over the Ably network
100 billion realtime messages now streamed to 50 million end-users each month
150% year-on-year growth since 2017
Some of the largest companies in the world depend on Ably to deliver mission-critical realtime experiences around the world

We’re using this funding to scale our team — we’ve already quadrupled in size since 2019 — so that we can focus on three things:

Investment in our core platform so we can build more features that we know developers need, such as the upcoming Data Deltas and the recent rewind capability on channels.
Further integrate into other event-driven platforms to help developers overcome the fragmented ecosystem we’re all operating in. Check out our recent Zapier, IFTTT, and Cloudflare Workers integrations if you haven’t already.
Explore and support more open protocols so developers can choose the right tool for the job. Right now we support Server-Sent Events (SSE), MQTT, STOMP, AMQP, and proprietary protocols of other realtime providers.

Ably has come a long way since 2013 when Matthew O’Riordan, Ably’s CEO, and Paddy Byers, Ably’s CTO, were unable to find a realtime solution that provided adequate guarantees for performance without compromising on data integrity or reliability. They launched Ably in 2016 after three years of intense research and development, investing over 50,000 hours into building the foundation of Ably’s complete realtime edge messaging platform.

Our focus from day one was architecting the Ably platform around Four Pillars of Dependability:

Performance
Reliability
Availability
Integrity (of data)

Today, Ably is still the only realtime edge messaging platform that is able to operate within strict, dependable, transparent boundaries, at any scale. It’s why we’re honoured to see so many engineers from tech startups to large enterprises, across numerous industries such as SaaS, mobility, sport, eCommerce and eLearning, to name a few, trust us for their mission-critical realtime needs.

Everyone at Ably is incredibly grateful for all of our customers who’ve supported us and we feel privileged to work with you, especially those that have trusted us as we’ve grown.

If you want to be part of Ably’s next stage and empower developers everywhere to build next-generation experiences, check out our careers page :).

Originally published at https://www.ably.io/blog/series-a on 06 Apr 2020.

Ably adds native integrations for Zapier, IFTTT, and Cloudflare Workers

Michael Ade-Kunle — Tue, 17 Mar 2020 10:50:48 +0000

A common requirement in realtime messaging applications is for developers to be able to insert some business logic into a message processing pipeline. Typical use-cases might be to perform filtering or payload transformation on a message-by-message basis, either when first ingested into the messaging service, or as part of a rule that captures messages from one channel, applies the business logic, and then forwards the message to another channel.

Ably supports these use-cases through Reactor Integration rules that invoke cloud functions (e.g. AWS Lambdas or Google Cloud Functions). By providing a gateway to cloud functions from cloud services providers, we believe we provide the best available mechanism for these use-cases while making it easier to build with the ecosystems you’re already operating in.

That’s why as of today Ably integrates with three new services, allowing you to act on realtime events in the Ably network by triggering actions and executing business logic across Zapier, IFTTT, and Cloudflare Workers. And, in the case of Zapier, publish to an Ably channel as part of a Zap workflow.

Why these three services?

Ably’s developer community already uses Reactor Integrations for a bunch of useful things like creating a profanity filter with AMQP and Neutrino or triggering serverless functions in AWS Lambda. With Zapier, IFTTT, and Cloudflare Workers even more things become possible.

Zapier is a natural integration for us to support as so many developers, including us at Ably, already have sophisticated workflows set up on the platform to connect and automate over 1,300 disparate business services.

IFTTT is the perfect example of how event-driven thinking is permeating both the business and personal digital worlds: it helps connect and automate apps and devices, with a focus on personal smart home devices and mobile apps.

And Cloudflare Workers is Cloudflare’s answer to serverless functions. Workers lets you build apps by deploying serverless code to data centers across 200 cities in 90 countries, with no need to think about regions. We’re already using Cloudflare Workers at Ably and we’re excited to partner with Cloudflare on this.

If you’re familiar with Reactor Integrations and want to jump right in, check out the following links. If you need a bit of background then read on.

Cloudflare documentation and tutorial to build a realtime browser-based game
Zapier documentation, tutorial showcasing Zapier's IoT capabilities, and webinar slides
IFTTT documentation and tutorial delivering text-based commands for simulations
First official Ably Zapier integration ⚡️
Reactor section of your Ably dashboard to set up new integrations

How these new integrations work

All three integrations essentially use webhooks to communicate with these services. You can set up a Reactor Rule in your app dashboard to control exactly how and what you wish to communicate with your endpoint. This can vary from the data you want to send (message or presence events), which of your channels to send from, and which endpoint to send to. Once that’s done, Ably handles the logic, execution, and delivery.

Our documentation goes into specifics about rule fields, enveloping, and batching. We also provide examples so it’s best to familiarize yourself with the relevant sections:

Bidirectional triggers, kind of

Reactor Integrations have typically been one-way: an event in Ably triggers an action in another system. But with Ably’s first official Zapier integration an event can now trigger an action in Ably (publishing a message to a channel) as part of a Zapier workflow.

To add Ably to a Zap, check out our step-by-step instructions.

IFTTT has known limitations

The IFTTT integration is fully functional but the capabilities are more limited than other integrations. This is down to how IFTTT accepts HTTP requests.

Usually, Reactor Integrations take any data published on an Ably channel and allow it to be forwarded to an endpoint, stream, or other system. This is because Ably expects publishers (e.g. an IoT device) to publish payloads on channels in a way that makes sense to them, and the subscriber (e.g. a Reactor Integration) to be able to parse that data and use it.

Unfortunately, IFTTT expects and accepts only a specific format of JSON. This means the publisher and subscriber are very tightly coupled, limiting the capabilities of this integration. Please read the IFTTT documentation for more info on this.

If you do set up an IFTTT Reactor Integration and face problems, please get in touch and we’ll do what we can to help overcome IFTTT’s current limitations.

How Reactor Integrations work

If you're not too familiar with how Ably’s Reactor Integrations work, here's a short overview. There are three integration types:

Events trigger actions in other systems based on realtime events that occur in Ably. They’re designed for low to medium volumes of data and encompass Webhooks, serverless functions, and now Zapier, IFTTT, and Cloudflare Workers.
Queues are our AMQP/STOMP queueing services. Ably hosts these. They help our users to consume high-frequency realtime messages that need to be further processed and transformed. We’ve written about how queues relate to pub/sub and overcoming difficulties of FIFO in distributed systems.
Firehose provides enterprise-only integrations into streaming and queueing systems our customers are already using such as Amazon Kinesis, Apache Pulsar, Rabbit MQ, and AWS SQS.

For each integration there are certain events that can trigger a Reactor Integration rule to execute:

Messages trigger function calls as soon as they’re published on a channel.
Presence events trigger function calls when clients enter, update their data, or leave channels.
Channel lifecycle events (batched messages only for now). When a channel is created (following the first client attaching to this channel) or discarded (when there are no more clients attached to the channel), the lifecycle event is streamed.

We’re investing in more integrations

At Ably we strive to make the complex simple. We released our first ever integrations to remove the frustration and complexity that comes with building, maintaining, and scaling multiple services and environments. We’ve supported Webhooks along with native integrations with services like AWS Lambda, Azure Functions, and RabbitMQ for a while.

Adding additional integrations gives our customers, who increasingly operate across a fragmented landscape of platforms and services, the flexibility to use the best-in-class services and compute they’re already working with while Ably handles the complexity and scale of doing so. We’re working on more Reactor Integrations, which will be coming throughout 2020 so keep an eye out.

This release brings the total number of native Ably’s Reactor Integrations up to twelve. When combined with Webhook support this means we’re able to provide a gateway to countless discrete pieces of business logic and code, some of which you might already be running.

And if you do something cool with these new integrations, be sure to let our Dev Rel team know - they’re always looking for ways to promote members of our developer community.

Ably Masterclass | Episode 2 - Building an IoT based realtime attendance system for Slack

Michael Ade-Kunle — Tue, 10 Mar 2020 10:42:48 +0000

Hello again realtime tech aficionados 👋🏽

I just concluded the second episode of the masterclass series where I talked about building an RFID based access card scanner that can publish messages directly into a Slack channel. In this article, I’ll summarise what happened and link to all the essential resources you may need to build this yourself or learn more about the various components involved.

You can see the recorded video of the actual session on our YouTube channel:

%[]

The slides presented during this masterclass episode are available online and are very much self-explanatory. I’ll of course explain the rest in this article:

I started off the masterclass with a brief explanation of the all the concepts involved, including MQTT, Webhooks, IoT components, etc. Later moved onto explaining the code to build this app.

A detailed written tutorial to build this out is already available on our tutorials site.

At the end of this masterclass, I demonstrated the full application, picked up an access card and put it near the RFID reader and a Slack message magically came through as shown below:

Btw, I secretly announced a link to join our brand new dev community (which for now is the Slack org that you see in the screenshot above) which is still being properly set out. But if you’d like to get an early access to this community, feel free to join in. 🤓

Useful resources from this masterclass episode

We’ll announce the next episode in this masterclass series very soon. Stay tuned! Ciao for now 👋🏽

In other very unrelated news, a super handy tip that I recently learned myself, press ctrl + cmd + space on your mac computer to launch the emoji keyboard 🎉🦄🌚🕵🏽‍♀️ Thank me later!

Everything You Need To Know About Socket.IO

Michael Ade-Kunle — Wed, 26 Feb 2020 09:57:46 +0000

This article explores Socket.IO, its main use cases and how to get started. We also help identify ideal use cases for Socket.IO, including signs your app has scaled beyond Socket.IO’s scope for support. This article examines where Socket.IO fits into the realtime landscape today, looking into competing technologies/packages, and what the future looks like for the library.

What is Socket.IO?

Socket.IO was created in 2010. It was developed to use open connections to facilitate realtime communication, still a relatively new phenomenon at the time.

Socket.IO allows bi-directional communication between client and server. Bi-directional communications are enabled when a client has Socket.IO in the browser, and a server has also integrated the Socket.IO package. While data can be sent in a number of forms, JSON is the simplest.

To establish the connection, and to exchange data between client and server, Socket.IO uses Engine.IO. This is a lower-level implementation used under the hood. Engine.IO is used for the server implementation and Engine.IO-client is used for the client.

How Socket.IO works

Socket.IO brings to mind WebSockets. WebSockets are also a browser implementation allowing bi-directional communication, however, Socket.IO does not use this as standard. First, Socket.IO creates a long-polling connection using xhr-polling. Then, once this is established, it upgrades to the best connection method available. In most cases, this will result in a WebSocket connection. See how WebSockets fare against long-polling (and why WebSockets are nearly always the better choice), here on the Ably blog. A full overview of WebSockets, their history, how they work and use case, is available to read here.

Socket.IO – In action

A popular way to demonstrate the two-way communication Socket.IO provides is a basic chat app (we talk about some other use cases below). With sockets, when the server receives a new message it will send it to the client and notify them, bypassing the need to send requests between client and server. A simple chat application shows how this works.

Example – Socket.IO for chat

Server

You will need to have node.js installed. We will be using express to simplify setup.

Create a new folder with:

Setup server and import required packages.

The server root will send our index.html which we will setup shortly.

Here is where we setup Socket.IO. It is listening for a ‘connection’ event and will run the provided function anytime this happens.

This will setup the server to listen on port 3000.

Run the application with node index.js and open the page in your browser.

Client

Include the following scripts on your page, before the closing "body" tag. You now have a socket connection setup.

This is the minimum setup to get the Socket.IO connection working. Let’s go a bit further to get messages sent back and forth.

Server

Inside the function we are using io.emit() to send a message to all the connected clients. This code will notify when a user connects to the server.

If you want to broadcast to everyone except the person who connected you can use socket.broadcast.emit().

We will also add a listener for any new messages received from a client and send a message to all users in response.

How to add these events into the client is shown below.

Client

Here is an index.html file which includes our previous scripts, a simple form with input for new messages and a container for displaying messages.

Now we will add some additional logic to our "script".

The key points here are the socket.on(event, callback) functions. When our server emits events which match the first ‘event’ argument the callback will be run. Inside these callbacks we can take the actions we want on the client-side. In this case, displaying the message on the screen.

Maintaining & Operating Socket.IO

As explained above, getting started with Socket.IO is relatively simple – all you need is a Node.js server to run it on. If you want to get started with a realtime app for a limited number of users, Socket.IO is a good option. Problems come when working at scale. Say, for example, you want to build a CRM-like app that enables communications between businesses. Socket.IO is built on asynchronous networking libraries and will cause load on your server. Maintaining connections to users as well as sending and receiving messages adds strain, and if clients start sending significant amounts of data via Socket.IO, it streams data in chunks, freeing up resources when the data chunk is transmitted. So when your application attracts more users and your server reaches its maximum load you will need to split connections over multiple servers, or risk losing important information.

Unfortunately this is not as simple as adding another server. Sockets are an open connection between a server and client. The server only knows about the clients who have connected directly with it and not those connected to other servers. Going back to the conversation function, imagine you want to broadcast a message to all users that someone joined the chat. If they are connected to a different server they wouldn’t receive this message.

To solve this problem you need to have a pub/sub store (e.g. Redis). This store will solve the aforementioned problem by notifying all the servers that they need to send the message when someone joins the chat. Unfortunately, this means an additional database to maintain which will most likely require its own server.

Socket.IO have created an adapter socket.io-adapter which works with the pub/sub store and servers to share information. You can write your own implementation of this adapter or you can use the one they have provided for Redis, with which, luckily, Socket.IO is easy to integrate.

Other reliability enhancers for Socket.IO might include CoreOS to break down architecture into units that can be distributed across available hardware, introducing new instances as the load increases.

Another issue with scaling Socket.IO is that whilst WebSockets hold their connection open, if the connection falls back to polling then there are multiple requests during the connection lifetime. When one of these requests goes to a different server you will receive an error Error during WebSocket handshake: Unexpected response code: 400.

The two main ways to solve this are by routing clients based on their originating address, or a cookie. Socket.IO have great documentation on how to solve this for different environments.

While Socket.IO does tend to have good documentation for ways round its limitations, these generally count as ‘remedies’ rather than solutions. If you intend to scale further, these suggested ways round end up adding complexity and extra margin for error to your stack.

When does Socket.IO reach its limits?

As with all tech, choosing the right one means being clear on your ambitions for your product. Socket.IO does make many things easier in comparison to setting up sockets yourself, but there are limitations and drawbacks in addition to the scaling issue mentioned above.

The first is that the initial connection is longer compared to WebSockets. This is due to it first establishing a connection using long polling and xhr-polling, and then upgrading to WebSockets if available.

If you don’t need to support older browsers and aren’t worried about client environments which don’t support WebSockets you may not want the added overhead of Socket.IO. You can minimise this impact by specifying to only connect with WebSockets. This will change the initial connection to WebSocket, but remove any fallback.

Client

Service

In this scenario, the client will still need to download the 61.2 KB socket.io JavaScript file. This file is 61.2 KB. More information on this process is here.

For streaming that’s data heavy by definition, for example video streaming, sockets are not the answer. If you want to support data exchange on this level a better solution is webRTC or a data-streaming as a service provider, Ably being one of several.

Socket.IO – the future?

Socket.IO doesn’t appear to be actively maintained. The last commit was approximately 3 months ago with most of the codebase free of new commits for much longer. Also, there are currently 384 open issues. For those starting a new project with sockets it is concerning whether Socket.IO will continue to be supported. At the time of writing (July 2019) the situation is unclear beyond the information below. If you have further information do get in touch.

Looking at NPM downloads, Socket.IO use has been increasing but only gradually.

On the other hand, Sockjs and WS have been steadily growing and have outpaced Socket.IO in NPM downloads.

This indicates that although use of sockets has increased, developers have chosen alternatives to Socket.IO. Some have chosen packages such as WS or SockJS. Others have opted for a hosted solutions where the complexity of real-time messages is handled for you, and many of whom operate freemium models.

As you can see below, all modern browsers now support WebSockets. This negates some of the need for a package which handles socket connections on the browser and explains the rise in popularity of packages such as WS which handle the server-side socket connection, but relies on the native browser API for client-side connections and communication.

Wrap Up

As we have explored, Socket.IO is a great tool for developers wanting to set up bi-directional socket connections between client and server. This makes simple applications such as live chat much simpler to implement. Socket.IO makes many things easier and provides fallbacks for unsupported clients, but has its own trade-offs.

Scaling applications is perhaps the most difficult step in using sockets, and Socket.IO’s implementation for non-WebSocket connections further complicates the process. Socket.IO’s future support is also questionable.

Aside from the question of future support, whether or not to use socket.io really depends on individual use case – for starting out building simple realtime applications, socket.io works well. With WebSocket support widely spread (answering to a huge growth in demand for realtime applications and services since Socket.IO was set up in 2010), there is now more choice to use similar packages closer to the native implementation, so it’s worth comparing Socket.IO to these as well. For more complex apps, or apps you think will scale, be prepared to add other tech to your stack. To help gauge what stage your app is at in terms of future scale, realtime needs, get in touch with Ably’s realtime engineers. We aim to be as impartial as possible.

Hidden scaling issues of distributed systems - System design in the real world

Michael Ade-Kunle — Tue, 04 Feb 2020 13:10:18 +0000

In the first distributed deep dive interview we explore hidden scaling issues.

Who are we talking to in this interview? Paul Nordstrom — Entrepreneur & former Systems Architect at Google, Amazon & OfferUp — and Paddy Byers — Co-Founder & CTO at Ably Realtime.

Distributed Deep Dive Series — #1 Hidden scaling issues: System designs in the real world

Paul Nordstrom is a seasoned entrepreneur and systems architect with vast experience having worked with distributed systems in the likes of Google, Amazon and most recently OfferUp.

Paddy Byers is a Co-founder and CTO of Ably Realtime with over two decades experience in software and services with a deep technical expertise in building scalable distributed systems.

LinkedIn, GitHub

We have transcribed this interview with Paul and Paddy below for your convenience.

Intro

[Matthew O’Riordan] — Welcome to our first episode of Distributed Deep Dive. In this series of videos, we’re going to be talking to interesting people who have worked on distributed systems. In our first series, we’re talking about hidden scaling issues and distributed systems in the real world. And today we have Paul and Paddy, and they will be talking about the real problems they face scaling internet scale systems, and we’re hoping that what you might learn from this is how you can avoid some of the mistakes that Paddy and Paul have made while building these systems. Paul would you like to introduce yourself?

[Paul Nordstrom] — Sure, my name’s Paul Nordstrom, and I’ve been writing software since high school, I programmed my way through college and then, a brief detour into finance but I ended up building financial systems. And when the internet came along Amazon offered me a job in 1999 so I was lucky enough to come at a time when they needed somebody to do a clean sheet architecture for their website. So that was my first internet-scale system and some of the service-oriented architectural things that we invented there in fact, have permeated the industry. So that was a very successful system. They’re still running parts of it, some of it’s been replaced. After about eight years of Amazon, I took a break, I wanted a change of venue, I went to this other little internet search company called Google. And at Google, I was in charge of the design and building of a system that’s been published. So if you’re curious about some of the examples you can read the paper called MillWheel and I was at Google for about 10 years.

[Matthew O’Riordan] — Excellent. Paddy, would you like to introduce yourself?

[Paddy Byers] — Sure, so my background is in Mathematics. In my doctorate, I worked on mathematical problems underpinning formal verification. In my career of work, I’ve done a wide range of different systems, safety-critical systems, security critical, embedded, realtime, and a lot of consumer electronics. At Ably Realtime I’ve built the core realtime messaging product, and although we have a growing engineering team now, I still spend most of my time coding and building the next set of features for the product.

Unexpected problems

Watch chapter

[Matthew O’Riordan] — Paul, so if it’s okay we’ll start with you. The first question I think we’d like to talk about is I think we have a good understanding of what the textbook problems are with scaling distributed systems but I think in the real world the problems we encounter are the problems we didn’t expect. Can you tell us a bit about your experience with this?

[Paul Nordstrom] — The first thing you do when you’re designing a system you try to figure out which dimensions it’s going to scale in. And that’s the part you’re talking about that we usually understand and although doing a good job of that involves shedding some of your egos, you should involve other people in the design discussion because there are things you’re going to miss. And you know, you obviously want to minimize the things that you didn’t think about in advance, so that’s the first step of dealing with the unintended scaling issues is to not have some of them. Once you get past that though, you have to accept the fact that you’re going to run into problems that you didn’t anticipate. And even if you’ve done a great job of minimizing it, the real world just throws more, the world of software is more complicated than a human mind can encompass. So you do everything you can, you build mathematical models for your system so that you have a coherent framework within which your system is going to operate and you can reason about it using the mathematical models. So that’s a great first step. And I think that in the industry that’s one of the short thrift, I think very few people really do a good job, and it’s one of the things I found when I started looking at Ably Realtime I was most impressed with was that it had a mathematical model, and you could understand what it was intended to solve, what problems, what scale dimensions were possible. MillWheel is the same, it has a mathematical model for distributing computations on time series, treating a time series as a foundational abstract thing that you could reason about. To get onto the meat of your question, once you’ve done everything you can, and you find that you have unintended scaling problems, well what do they look like? And maybe having heard about a couple of them here you can help anticipate that happening to your systems. One unintended one that you think of in advance and convert it into a known problem in advance will make an incredible difference in the quality of the system you build, and the time within which you get it done. Because solving these problems after the design phase is you know, sorta order 10 to 100 times as difficult right, we all experienced that. Fixing a bug takes at least 10 times longer then it does to avoid that bug in advance by making a good design. In MillWheel one of the things we didn’t anticipate was the rate at which, so MillWheel has this partition space, okay every piece of data being manipulated has a key let’s say the data that you’re manipulating is Google queries, in fact, MillWheel is actually used to process the stream of Google queries received on the internet. And we anticipated that some queries happen more often than others, okay. We didn’t do a very good job of analyzing the data and we didn’t realize there was this one query that so far exceeds every other query that not a single machine could handle the processing of it. It happens to be the query Google. People type “google” into the Google search box with a great regularity more than any other single query. If we had talked to the team, one of the reasons we didn’t anticipate this is we didn’t know that MillWheel’s gonna be applied to this problem, we couldn’t talk to the team that did it. But we could have done a better job of talking to more of the community about what they might use a system because it was fairly well understood that Google needed a system for processing continuous data at the time. They had MapReduce and MapReduce is a big, enormously scalable system that solves plenty of problems but not the continuous data problem. So all of these are like their little lessons but they add up, spend more time talking to your users about how they would use your system, show your design to more people, you know, just shed the ego and shed this need for secrecy if you can, so that you get a wider spectrum of people who can tell you, I’m gonna use it like this. And then, when you run into the inevitable problem, you know, then you just have to, that having done the work that did before, your system will be cleaner design, you’ll have this mathematical model. You know, then at least when you get into this case of a problem that you didn’t anticipate you will have a cleaner bed from which to solve it.

Matthew O’Riordan] — Would you like to add anything to that Paddy?

[Paddy Byers] — Yeah, I think just to echo what Paul said so, when you set out to build a system, you know roughly speaking, what the first sort of problem is you’re trying to solve. The scaling problems you run into are the second order of problems often. So this is not just how many messages can I support in the channel but the second order of problems, the rate of change of a number of channels, or rate of change with the number of messages. Sometimes you’re caught out because the engineering catches you out. Things turn out to be harder then you thought, and sometimes the user catches you out. So they use the system in ways that you didn’t expect when you designed it so, you find a whole bunch of new use cases that you then have to go and deal with or they give you scaling issues that you didn’t solve originally.

[Paul Nordstrom] — Alright so, let me add to that, sometimes you’re going to come across, you know attempt to solve a problem, one that can’t be reasonably solved. And sometimes you just have to accept that your system can scale in every dimension. And you might just have to say to the customer inherent in our system is a limit on the number of channels we can support, or the rate in which you can add new channels, or whatever it is in your system. In fact, tying this back to the Mill Wheel problem, one of the things we said is if your key space can’t be partitioned so a single key can be handled by a single machine we’re not going to solve that for you, you’re just going to have to find a different solution for that problem. And I think it’s important to realize that trying to make your system solve every problem eventually ends up with a system that is too complicated to use or one that doesn’t work very well for any of the problems you’re trying to solve.

Limits of scale

Watch chapter

[Matthew O’Riordan] — The limitations of the system and the limits of your scaling parameters is obviously important from the start. How do you think you go about understanding those limits, do you understand them once you’ve built the system or do you think you can preempt some of that or do you think you just deal with problems as they arise in areas that you may not understand fully?

[Paul Nordstrom]—I think the first point is that you have to understand at least one scaling issue from the get go which is the scaling issue that your customers have that you’re here to solve. If you haven’t talked to people about what their needs are then you haven’t clearly identified the way in which your system is going to scale in the dimensions they need, you haven’t done your homework and you’re doomed to fail. Unless you know, without just great luck. If you have chosen a set of scaling dimensions that you’re going to attack with your design, and you study those dimensions and you make sure that your architecture, I think that people get this part though. I think that they understand that you need to solve the issues of the customer and I think they understand how to design it’s the second order ones again. And you know, we talked earlier about how do you know, at least try to head off some of the issues with the second order scaling issues but really the answer to your question is if you’ve done your homework, and you have addressed the issues that your customers need you to solve, and hopefully you’re providing something nobody else does or you are much better at than other systems out there. You know, your systems need to have a competitive advantage too. So they need to solve customer problems better than the other systems that are available to your customers, than you’re gonna have a successful system. Whether or not you end up with second-order scaling issues that you can or cannot solve you’ve met this primary need, and but what I think is that you’re really about you know, clear thinking and have a clear process for identifying those and solving those.

[Matthew O’Riordan] — Paddy do you think, I mean when designing building and then running at scale, do you think there have been second order type problems that even now looking back, you think would have been hard to predict what those problems were until those problems arise, I mean are there any sort of specific examples you can think of that kind of you know, show how difficult it maybe is to predict these types of problems sometimes.

[Paddy Byers] — Yeah, I think the kind of example that comes to mind is the sort of thing I mentioned where the customer catches you out. So they’re doing something that you didn’t anticipate. So we built something imaging that channels were long-lived and we optimized to the greatest extent possible the cost of processing a single message. But that catches you out if what the customer’s really doing is they’re creating and destroying channels at a very high rate. Another example is a fan App, right we imagined we’d be dealing with typically, one of the scaling dimensions is, I want to be able to fan out messages to a very, very large number of subscribers but then we have customers who don’t have that problem, instead they have channels with a single subscriber and what you need to do instead is minimize the cost of establishing the first subscriber not establishing a million subscribers. The other thing I would add to that is in understanding the scaling limits I think what you have to look out for is what’s an order end feature of your system, and what is or can be better than order end feature of your system. If everything, in the end, will become a limit to scaling, the question is do I have to deal with it now or do I know I can deal with it upfront, so if it’s an order end thing I know I’m gonna have to deal with it at some point, but do I know a way that I could improve that when the time came or is it something I have to address now.

[Paul Nordstrom] — I think that’s one of my fundamental precepts of the building of a system. Even if you have a pretty clear idea of the design you can’t build every dimension of your system to its ultimate capability on the first pass. You have to pick ones that you can understand well enough to build right, you should build those right, but the ones that you know you don’t understand well enough to build right you should build a hack that has a shell around it that you can then you know, use as let’s say, an interface, or an obstruction layer, but eventually but not actually tries to build it when you don’t know enough to do it. And one of those things is, when it’s a customer, this is tying off of what you just said, when what you don’t really know is how people are gonna use it and then it’s not that you couldn’t solve one of the problems and you don’t understand but it’s you don’t know which is really your problem yet. Then you can make a much better decision later on.

Microservices

Watch chapter

[Matthew O’Riordan] — How does what we’ve been talking about relate to microservices?

[Paul Nordstrom] — Well, you’re pushing one of my buttons here because personally, I don’t think there’s such a thing as microservices, I consider the whole thing to be a giant buzzword that people have used so they get internet audiences and so on. I think services span a range from very, very small, you’re free to call those microservices if you’d like, but there is no cut off where they start becoming not microservices they just get bigger and bigger until they’re gigantic services that have large scale functionality. And your choice of the use of these depends on your problem and you should choose the right one for the job. And you should make a balanced choice between functionality and complexity, power, size, etc., and ease of use. If it were up to me I’d wipe the term off the internet discussion list and get rid of it. But the important things about the microservices movement, in my opinion, that is valuable is that the tendency should be to push yourself down a scale when possible as appropriate the smaller service is probably the better one. It’s easier to reason about okay, it’s easier to swap something else in if you need different characteristics then the choice you made at the beginning of the problem then the choice you made at the beginning of the problem. So I don’t believe in microservice, in fact, I probably will never say the word again after this interview. But I do believe that learning to balance the use of out of the box components and to chose wisely should be geared towards the smaller of the two choices if all other things are being equal. I think that’s important, okay. But that said, often microservices are in contradiction of the need of a system to scale in dimension because one of the ways you get scaling out of a particular functionality is to couple two parts of it, the tightly coupled one usually outperforms the loosely coupled one. And architecture composed purely of microservices in the long run, if it’s attacking a truly difficult scaling problem, I don’t think you’re going to, in general, be able to solve it just by composing microservices, no matter how well you do. I think you’re going to be forced to choose in certain cases, to couple search semantics with the storage semantics, for instance, and you end up with something like relational databases instead of something like Spanner. It depends on the situation, I’m not calling out Spanner in a negative way, Spanner was a brilliant piece of work, and it solves a huge swath of the problems space you know, in the dimension that it’s approaching in storage that had never been attacked before.

[Matthew O’Riordan] — So Paddy, at Ably we’ve, I know we’ve had this discussion about the dreaded word microservices, but my understanding was that the problem with splitting a lot of the services that we run up was that actually, we’re creating not just potential performance bottlenecks, which are probably less of a concern, but more operational bottlenecks. And that trying to sort of manage versioning across APIs is across lots of different services, just creates a lot of complexity in the system unnecessarily. What are your thoughts on what we’re doing within the RPC layers and how components talk to each other but still sort of keeping them as single larger components.

[Paddy Byers] — A lot of what we have done, the systems grew up in relatively model ethics, and we face the question of when is it a good idea to split something into a separate service? And as you say, there are operational complexities and performance complexities that come about when you do that. However, every single thing you do eventually becomes a problem no matter how insignificant or negligible performance wise it seems to be initially, eventually, it will become a problem. So the real question for us is at what point does something become significant in a way that means you want to scale it independently? So do you want to be independently elastic with a particular service or is there a particular advantage, a person needs to be able to deploy it and give it a different deployment lifecycle from other services. And that’s the point of which you would then consider splitting yourself out into a microservice, and that’s broadly the approach you’d take.

[Paul Nordstrom] — I wanted to add to that is that I think that you should know and be conscious of the core functionality, we talked about earlier, what problem is your system solving for its customers and in what dimension in scaling, in particular, it’s solving a problem for your customers. And, you know when somethings not central it’s much easier to split it out and make it something independent and if it is central it’s much less likely to become a service by itself. Because like I said, the loosely coupled hardly ever outperforms the tightly coupled. You can’t afford to have everything tightly coupled so and of course, everything isn’t central. So it’s not that necessarily hard a thing to do but I think it’s a conscious decision. You said alright, this is central to my system, I’m willing to be coupled here.

CAP theorem

Watch chapter

[Matthew O’Riordan] — So I think it’s interesting that the conversation has largely been talking about balances and trade-offs in systems and understanding where your limiting factors are. It reminds me a bit of CAP theorem and the idea that you can only honour two of the three principles. What are your thoughts on that in regards to distributed systems and how that applies to what we’ve been talking about?

[Paul Nordstrom] — The CAP theorem tells you that you will have to make trade-offs, right? To my knowledge, there are no examples of a system that you know, violates that, that manages to make no trade-offs between the CAP principle properties. My favorite system design, in fact, it’s come up before, is Spanner. And Spanner makes great promises in all three dimensions but it doesn’t violate CAP it just hits the sweet spot. And the reason I’m bringing it up is because I think it’s a great example of something people should consider when they’re designing and building a system is to recognize and acknowledge they’re going to have to make trade-offs and then find the sweet spot for whatever it is they’re designing that solves a wide swath of users’ needs and that does a great job of capturing the users’ demands in each of the dimensions they care about. You should be aware you’re going to have to choose at most two of those you know, the CAP properties that you’re going to truly solve. But I think a system that does a really good job in each of them and not perfection, I mean, we were talking about outages you know, earlier today and outages happen regardless of what your system was designed for, you may perfect guarantees around eventual integrity. But that doesn’t mean that your users are going to see perfection because they’re not, their personal connection might be down to their internet provider. And what they really care about is their business. I think that doing a good job of satisfying scaling needs in every dimension that you do identify, as we sort of talked about at the beginning of this talk, is more important to them than meeting a sort of theoretical CAP properties test.

[Matthew O’Riordan] — *Okay, I think that’s, I mean, you know, in the context of a messaging system, Paddy, I can see how those lines would get blurred and deciding what that sweet spot is because you know, the latency of a message to be delivered is something that will be variable and I expect certain use cases require latency guarantees. So I mean, what are your thoughts Paddy, on how I suppose, that thinking is applied at Ably Realtime.
*
[Paddy Byers] — How does it relate, so yeah so first of all, sort of naively, the CAP theorem would have you believe that these are binary properties. That meansthe system is either available or not available. And in reality, that’s never true. You get outages in some part of the world, but in a global system, you can’t deny service just on the basis of a single regional fault. So availability and the other properties are not binary attributes. But the other thing that CAP specifically applies to is the understanding of integrity if you want to build an asset database or you linear ability as a property of your system. In our case though we get to choose the semantics. So we’re able to say, in the specific case of messaging, in the case of a conflict between consistency and availability that we get to choose exactly what semantics we decide are useful for our customers and what things we guarantee to uphold in the presence of certain kinds of failure and what things we don’t guarantee to uphold. So we have a lot more flexibility in designing our system then you would have if you were designing a relation of a database, and you were trying to make the CAP tradeoffs. So what we’ve done through a combination of what our customers tell us what their requirement is and our own understanding of what’s achievable in practice.

Fault tolerance

Watch chapter

[Matthew O’Riordan] — What are your thoughts on fault tolerance, Paddy?

[Paddy Byers] — Fault tolerance, so I think you have to set out with the idea that everything will fail at some point. So you have to do things as a matter of routine so as to recover from certain failures. And then the decisions you face are what kinds of failures am I gonna handle routinely with no degradation of service and what kinds of failures am I prepared to allow for some degradation of service? And in doing that then you have to look at well how would I survive these failures, what level of redundancy am I gonna maintain, what is the cost of maintaining that, not just the resource cost but also the performance cost, the operational cost of managing that redundancy? Also if there’s more then one failure do I really think those failures are independent? So if I think that my failures are not going to be independent, so when one thing fails then I’ll have a greater likelihood of the thing I’m failing over to, also failing, then no amount of redundancy is gonna give 100% fault tolerance. So it’s a trade-off between understanding customer requirements, understanding the business operational cost of achieving it and understanding the real world engineering practicality of actually making it possible.

[Paul Nordstrom] — I agree with Paddy’s comments about that, and another dimension of it to be really conscious of is to consider how to balance between minimizing the mean time between failures and mean time to recover. In any system that I’ve worked on, and once I know about, often times this isn’t done thoughtfully and consciously upfront, but sometimes minimizing the time to recover works out much better than minimizing the time between failures. An engineer’s natural tendency is to try and minimize the time between failures, right, to make the system as reliable as they can. That’s not always the right answer and sometimes just making it recover really quickly from failures gets you where you need to go, in terms of value to your customer, and simplicity of the engineering, and it solves problems that you might not have considered. You know, I mean by nature it’s trying to minimize the mean time between failures means figuring out what’s gonna fail and preventing it from failing, right. Minimizing time to recover solves problems you never considered before, and we talking about second order and unanticipated scaling problems, well this is sort of applying that same concept to the failure scenarios where you try to minimize unanticipated failure problems too. So this is something I was you know, happen to consider, didn’t learn until I went to Google and at Google, they’re super sophisticated and genius there obviously. This is one of the things that they taught me, is that stop at the beginning say, okay in every kind of area of failure, what are your areas of failures, that goes to what you just said, but then secondly am I going to attack this area of failure by making the system recover very quickly from it or by avoiding the problem through extensive tests and great design, and so on. And I think there’s a really important lesson here.

[Matthew O’Riordan] — Do you think that applies also to dealing with an unexpected load? I mean there’s an element of sort of, a hitting a high water mark, and that I think for me, my understanding is that the high water mark kind of sits almost in both camps, that you’re saying I’m willing to reject some work so that I can save the system and recover, but equally I’m degrading the service at the same time or degrading the service at least. I mean, how do you think that applies to what you’ve been talking about, I mean is there anything specific you think out listeners would want to know about?

Well I think yeah, reflecting on what Paul said, so in terms of coping with load, so you’ve designed things to be elastic, they can scale when loading scales but then you have the second order problem which is how quickly can you do that. And you know, you can do two things, one is you can operate with a margin of capacity to be able to handle a certain spike in load without having to scale and then you can react to that instantly or you can build things to be able to scale very quickly. So building things to be, not just inherently elastic, but inherently scalable, being able to react to spikes. You can reduce the complexity of the system potentially if your way of reacting to spikes is to be able to scale very quickly.

Team structure

Watch chapter

[Matthew O’Riordan] — One area I think we haven’t yet touched on is the people element of how you build and maintain a distributed system and arguably this is the most important element. Paul, I think you’ve obviously got a huge amount of experience and insight into this area. Can you share your experience, wisdom on how to structure teams?

[Paul Nordstrom] — Well I’m not sure structuring them is the right question, I think that you know, to build a scalable system you need the whole team focused on that. There’s no substitute for just experience building these systems, you have to engage the team and you have to make sure that people, that there’s at least somebody on the team who has this experience or you’re not going to end up with a system that’s scalable. But beyond that, I think that there are all sorts of, I don’t want to call them tricks, but they feel like tricks and they’re effective at ending up with a design. Part of that is to make sure that the entire team, I’m just afraid here that I’m gonna end up saying, you know, what’s the expression? Mom and apple pie, you know? I’m afraid that this part of the discussion is gonna end up with a mom and apple pie kind of answer because I don’t think there are any silver bullets here. I think that there is, it’s important to involve the team in the design so they’re all bought into it, I’ve made that mistake before a couple of times, you know, where I had the idea in my head, the design in my head and thought you know, that it was so clear to me, right? Bad answer. But you know, there’s other things you can do like if there’s a design question that is difficult and not clear one of the things that I found really effective is to make you know, pick an advocate of let’s say the two most likely solutions to the problem, and then have one person argue one side of that and another person argue the other side of it and then make them swap. And argue the opposite side of what they believe, and this is has worked out great for me in terms of getting to a consensus because it makes people look at the problem from the other person perspective. And people don’t want to argue the side of the question that don’t agree with or believe in, but it’s really effective. The other thing about getting a team to build a you know, a system that scales both in the you know, error space which we were talking about a minute ago, and in the sort of production throughput space which is where we started the discussion, is to have a conscious list of things you’re going to address and get those on the table early on. A lot of things we discussed today you know, if you gather them into a little checklist, you can then make sure that you didn’t forget to do those. There are so many aspects of system design, right? And a lot of them are judgment calls, they’re not answers, they’re just you know, but if you don’t consider those questions you’re not gonna make good judgment calls. And so, having together a really coherent and inclusive list of the aspects of system design, really this is a kind of a summary of what we’ve said today, is that having that checklist is what leads to, and then using the team concept to explore the checklist, you know, assigning people to aspects of that checklist, to consider and to you know, come up with a recommendation, and a plan and so on. You know, I think this sort of ties together the whole conversation we’ve had today which is that there aren’t silver bullets but there are points that need to be addressed you know, consciously and thoughtfully and that if you do so you end up with both the good teamwork side of this whole problem addressed and with good answers to the technical questions that need to be addressed, so.

[Matthew O’Riordan] — That’s interesting, I think is also quite important to consider one thing that is also the complexity of the system. And I’m interested to know how often you’ve had to reduce the complexity of a system knowing it will affect the ability for it to scale potentially because the complexity is too great for the team the maintain. And so as an example, Rob Pike famously talking about how Go was designed to not necessarily to be a language with the most features but a language that allowed the team to work coherently together and understand each others code. Has that ever happened in the designing and building of distributed systems?

[Paul Nordstrom] — It’s a great question, partly because that’s what I would consider being my largest failing as an engineer, the place where I get into the most trouble, is not simplifying when I should have. So as a lesson to my future system designs, and as a lesson to other people maybe who are a little like me in this way I think it’s brilliant thing to point out is that that’s a choice that you should lean towards, you know well especially if you’re inclined like me to design things that are maybe a little too complicated. Intentionally restricting both the capabilities but at the same time the complexity of you’re system, I think is a brilliant way to end up with a system that works well, and you’re just like going a little bit too far and then coming back to your comfort zone, you know you just end up with a system that works better and is inherently more understandable, and those are both important points.

[Paddy Byers] — Definitely, less is more a lot of the time. I think that you have to remember that fundamentally, building software is a human activity and it’s you know, it’s performed by humans and maintainability is just as important a property of the software you write as reliability, and security, and performance, and all those other things. And ultimately, you do have to make sure that you build things you know, within the cognitive limits of the team that you have.

[Paul Nordstrom] — In fact as a leader of one of these teams building something you know, something that I wish I had done before would be to stop and say okay, here’s the team, my team has to build this piece of software, right. In a practical I want something to come up that’s out of this effort that beautiful and works really well. I have to take into account just the general capabilities, the expertise level, the experiences that they’ve had in the past, and even how well they work together to use that to influence what you just said about choosing what level of complexity I’m going to design the system too.

Tips

Watch chapter

[Matthew O’Riordan] — I think what’d be really nice to extend a set of recommendations to our listeners, to just say if you’re going to, after watching this video, what actionable things can you do to take this learning and apply it to distributed systems that you may be trying to build. Can you share your tips Paul?

[Paul Nordstrom] — I’ve mentioned earlier about having a checklist. What I don’t believe in is that there is one checklist that is right for everybody, that I should put down my checklist and put it on the website and then you guys can use it. I think that each of the engineers watching this video are facing a set of problems with a different team, in a different environment, and has his or her own personal experience level and needs. I’m hoping that you know, maybe on review or maybe you’ve been doing it as you went along, you’ve heard some things that resonated with you and so my final point is only that you should consciously make that effective by turning it into a list that you can use to remind you not to omit things when you’re doing your next system design. Because the design of a system in my mind, it’s the most complex thing undertaken by mankind, is the design of a large software system. You know how man-years of efforts go into it, and it’s not just man-years of work, it’s man-years of thought okay. It’s unbelievable. Alright, so there’s no way anybody I know can keep this whole thing in their head without extra aids. I advise that you create external aids customized to yourself, and to your problem, and your environment. And then use those, apply them to the systems you’re designing in the future.

[Paddy Byers] — What I’ll add to that is, go back to what we said at the very beginning, which is a computer science text book will list out the problems that you have to face when you’re building a distributed system, you have to cope with unreliable networks, you have to cope with latency, you have to cope with consistency issues, but really what we’ve been talking about here is, is what you learn when you go beyond that, when you design things to operate at scale and when you have experience operating them. So what I would encourage people to do is get away from the text book and just go build stuff and that’s the best way to find out how to do this for real.

Message queues : the right way to process and transform realtime messages

Michael Ade-Kunle — Tue, 07 Jan 2020 11:34:16 +0000

You can also read our recent post on SQS FIFO Queues, which looks at why converting a distributed system to FIFO and exactly-once message processing requires considerable user effort and what to bear in mind if planning to implement this.

In this article I explore the juxtaposition of a pub/sub fan-out messaging pattern and a queue based pattern, and why both are sometimes needed. Finally, I look at the solution to this problem.

A pub/sub pattern is designed so that each message published is received by any number of subscribers. This pattern is used by most realtime messaging providers, including Ably. Queue based patterns on the other hand typically require that each message is received only once by a single subscriber in a linear yet distributed fashion, often to be processed by workers.

A typical scenario where the pub/sub pattern works well

Take a company like Urbantz that rely on Ably to broadcast the position of vehicles as they traverse our roads. If you set out to build a similar GPS delivery tracking system, the flow of data between the vehicles and consumers wishing to track parcels may look something like this:

The pub/sub pattern and Ably's platform is a good fit because:

The vehicle publishing its location is decoupled from anyone subscribing to messages. As the publishing client receives an acknowledgement (ACK), then it can trust that the data has been broadcasted successfully.
Any number of devices can subscribe to updates on the channel dedicated to the vehicle, and those devices will see the position of the vehicle in real time.

When pub/sub feels like forcing a square peg into a round hole

Expanding on the example above, if you were to build a complete vehicle tracking system, you may have additional requirements to:

Persist roll up data for the vehicle’s GPS locations into your backend database. For example, you may want to store the most recent lat/long every 15 seconds.
Trigger actions as part of your workflow when a vehicle reaches its destination or when it’s delayed.

I’ve seen other realtime platforms mostly recommend approaching this problem in one of three ways:

All data that would have been broadcast in real time is instead sent as an HTTP request to your own servers. This isn't ideal because:

Any latency in your own servers will affect your clients
If your servers are unable to cope with a sudden burst of realtime data then the lat/long data is lost
You lose the benefits of a global resilient realtime platform that routes data efficiently i.e. data in EU is never unnecessarily routed through the US

This solves the problem of latency and resilience by using Ably directly from the publishing client, but it does introduce a new problem:

Operations can no longer be atomic. What does the client do if the publish to the backend server fails, yet the broadcast to Ably succeeds? A single failure can result in your client devices and servers having different representations of the state with no straightforward way to rectify the problem.
Each publishing client has to do double the work and consume at least twice the bandwidth for each broadcast. On mobile devices, this matters.

Our customers often find this approach seems like the most obvious answer to the problem, but it has many flaws and technical challenges:

If you have a sudden sustained burst of realtime messages published across all your channels, your servers could easily fall behind. We typically retain connection state for two minutes, so if you fall behind by more than two minutes you’ve got problems and can expect data loss.
How do you distribute the work amongst your workers? Assuming you had 5,000 channels with one message per second each, and based on your testing you know you can process 500 messages per second per server, then you will need to work out how you share the work out amongst your workers. The pub/sub pattern is a bad fit here as if you had 10 workers subscribed to all 5,000 channels each, they would all be processing all messages on all channels i.e. 5k messages per second each. The solution to this we most often see is to use a hashing algorithm to work out which workers subscribe to which channels. But this approach adds a lot of complexity especially when channels are dynamic and are added and removed on-demand.
Your workers now need to maintain state. They need to know which channels are active at any point and need to ensure they can retain this state through redeploys and crashes. This is hard, especially when you have channels frequently opening and closing. WebHooks can alert you to channels opening and closing, but what happens if your system fails to process one of these requests correctly? The answer may be a periodic re-sync step, but therein lies yet more complexity.
If one of your workers is offline for more than two minutes then you will likely lose data. You can use our history feature (aka persistence) to retrieve missed messages. But that again adds complexity, unnecessary storage of data for these edge cases, and bottlenecks in how quickly you can catch up given history requires a REST request per channel per batch.
You now need stateful servers instead of stateless servers. I'm personally an advocate of stateless servers where possible as unnecessary complexity can often be avoided.

What's the soltuion to all this?

Message queues: the right way to process realtime data on your servers

Before we dive into why message queues are the answer to this common problem, I want to quickly explain what queues are and how they differ significantly from Ably's pub/sub channels.

![A simplified illustration of a FIFO message queue](https://ik.imagekit.io/ably/ghost/medium/max/1600/1*pvpSVgbXKFHfRWyKLwbSxQ.png?tr=w-1520

Queues provide a buffer to cope with sudden spikes
Data messages added to a queue are stored and held to be processed later. As a result, adding messages to the queue is completely decoupled from subscribers wishing to take messages off the queue. If subscribers cannot keep up, the queue simply grows and the workers are given some breathing room and as much time as they need to catch up.
Queues fan out work by releasing each message only once
Unlike Ably’s channels which deliver messages to any number of subscribers, queues will only deliver messages in the queue to one subscriber. As such, if you have 10 workers processing 5k messages per second, each will receive 500 of those messages per second. Therefore, each worker can process the data it receives without having to worry about whether other workers have received and processed the same data. This allows the workers to be designed in a more straightforward, stateless way.
First in first out
Queues by default operate using FIFO which means first in, first out. This approach ensures that if a backlog of messages build up, the oldest messages are processed first. Our engineering team recently wrote about some considerations when implementing Amazon SNS FIFO Queues into a distributed system.
Queues are real time in nature
If your subscribers pick messages off the queue at the rate they are added, then the additional latency added should be in the low milliseconds. In practical terms, a queue does not add latency.
Data integrity
If a worker picks a message off the queue, but does not send an acknowledgement of the message being successfully processed, then after a short period the message will become available on the queue again to be processed by the next available worker. This feature ensures that messages are never lost.

How Ably provides queueing for our customers

If we now reconsider the problem of how to build a vehicle tracking service and process the data using your own servers, we can recommend a the following approach using an Ably message queue:

The benefits of this approach are:

Rules are applied to messages published on channels that copy messages from the realtime channel to a queue asynchronously. This ensures that messages are published with low latency to subscribed realtime clients on the channel, yet also added onto a queue immediately.
If you are unable to process the queue quickly enough, we provide a reliable buffer ensuring we hold onto messages until you are ready to process them.
Any failure to publish a message on a channel will not result in the message being added to the queue. The operation is atomic.
Your workers can be scaled up and down as you require without having to worry about sharding the work between them. Our queue service automatically ensures that each message is delivered to only one worker.
This encourages our customers to have a more stateless design in their systems and thus significantly reduce the complexity.

Some queue specifics

The following data types are supported by message queues:

Messages — get notified when messages are published on a channel
Channel lifecycle events — get notified when a channel is created (following the first client attaching to this channel) or discarded (when there are no more clients attached to the channel)
Presence events — get notified when clients enter, update their data, or leave channels

If your expected volumes are low, we support WebHooks. WebHooks provide a means to push messages, lifecycle events and presence events to your servers over HTTP reliably.

If you are interested in using message queues, or have any questions, check out the docs or get in touch with us.

The realtime API family

Michael Ade-Kunle — Tue, 10 Dec 2019 10:09:06 +0000

Looking to 2020 and beyond, the proportion of data produced and consumed in realtime is growing exponentially. IDC predict that by 2025 1/3 of all data produced globally will be realtime. The engineers and organizations that make up the realtime ecosystem, Ably included, have yet to agree on how we describe the APIs we’re creating and consuming that are powering this growth. The problem is that there are various ways to describe APIs that provide realtime functionality.

API evangelist Kin Lane, and many others, have been writing and talking about these new types of APIs for a while. Event-driven seems to be the most common descriptor. Gartner has adopted the term, stating that by 2020 50% of all managed APIs will be event-driven. But still today there is no consistency or consensus between what terms like realtime API, event-driven API, or streaming API refer to. Often they’re used interchangeably.

As an engineering team and API provider working on a global pub/sub messaging platform, we work with these ideas everyday. Talking to our users we see the language they use to describe both what we do and what they achieve using our APIs. And we’ve watched discussions around event-driven architecture, webhooks, and streaming data proliferate.

Over the years we’ve thought extensively about the best terminology to use and arrived at what we call the Realtime API Family.

The Realtime API Family

Defining realtime
Streaming APIs
Pub/Sub APIs
Push APIs
Event-Driven APIs

Defining realtime
As a concept, realtime is easy enough to understand: a function or interaction perceived as immediate. Ably defines it as anything that happens in under 100ms. Any API designed such that data flowing from producers to consumers happens in the shortest amount of time possible can therefore be described as realtime.

Ably has settled on realtime API as an umbrella for event-driven, streaming, pub/sub, push, and other APIs. We’ve seen many more than four ways of describing this, but these are the most popular. It’s why we call it the Realtime API Family.

Streaming APIs
A stream is a means of transporting data. Streaming is a consumer pattern that describes how consumers receive events through a stream.

A streaming API will commonly address issues of data integrity with:

Message ordering - ensure messages are delivered in the order they were published
Stream continuity / resume - upon disconnection, resume from the disconnected point within a set period of time
Contiguous serial numbers - a simple series of ACK or NACK responses, each addressing a contiguous sequence of messages

Kafka offers a streaming API following this model for internal systems.

Pub/Sub APIs
Pub/Sub (Publisher/Subscriber) is an infamous messaging design pattern. When it debuted in 1987 it solved loose decoupling of servers. The earliest distributed systems used to send events between servers using Pub/Sub, but all within reliable networks. This pattern operates by publishing messages on a topic (or channel) within a message bus, and subscribers can listen for events based on those topics.

Pub/Sub is over 30 years old now. The design makes no attempt to define semantics around ordering or continuity (loss of connection). While it’s still a pattern used widely today, these are issues that must be considered and addressed in our world of unreliable connections.

Ably provides a pub/sub API following this model but with guarantees around ordering, continuity, idempotency, and more.

Push APIs

Push is another umbrella term for every API that is real time. But it's just a producer messaging pattern. It means that data is pushed upstream over a connection, versus a pull mechanism used by the request/response pattern.

An example is helpful. A push notifications API is an example of a push API. Or a sporting event might send a single score update that traverses multiple systems via push APIs, resulting in millions of messages to global fans within a few hundred milliseconds.

Push can also refer to triggering a new request as with a Webhook. Or it can mean ‘push subscription’ - i.e. a producer needs to reach out to a consumer. WebSub is an example of this. But that’s a whole other blog post.

Event-driven APIs
Event-driven is an architectural design pattern that defines how a system processes data. It simply states a system should be responding/reacting to events as they happen. Streaming, Pub/Sub, and Push are all messaging patterns that can delivered through an event-driven architecture. As such they can all fall under the umbrella of event-driven APIs.

Most of us know what an event-driven API is but an example never hurts. Unlike traditional request/response APIs where data is requested, event-driven APIs push data from a producer to a consumer. They can be quite simple or very complex.

In the above example a retailer needs an event-driven architecture to respond to things as they happen. An event in an event-driven world triggers a chain of events which must be processed downstream, extending through the entire data supply chain. All components in this supply chain are reactive, rapidly responding to events and performing onwards processing.

The time-bound, reactive nature of this data supply chain causes increased engineering and infrastructure complexity. And when compared to REST APIs following the request/response model, the complexity is inverted and put onto API producers rather than API consumers.

However, that’s a conversation beyond the scope of this article. But it’s something we’ve spoken about before at API World 2019: Taming the rising complexity of event-driven APIs.

And so we arrive at Realtime APIs
Streaming, Pub/Sub, and Push are all patterns that can be delivered through event-driven architecture. But the outcomes of streaming, push, pub/sub, and event-driven are all realtime functionality. They’re all means of getting data from producer to subscriber in the shortest possible time. So we can call them realtime APIs.

But you can’t have a realtime API unless it’s delivered as an event-driven, pub/sub, streaming, or push API. Hence how we’ve settled on the umbrella term of realtime API to encompass them all.

Navigating complexity and fragmentation
As you can see, it’s easy to use these terms interchangeably despite them all being different. The realtime ecosystem is still maturing and changes all the time. Eventually we believe developers and organizations will naturally come to a shared and standardized terminology for APIs designed to deliver real time functionality.

Until then, the Realtime API Family helps us to reduce complexity when we explain our cloud infrastructure and APIs to the developer community, our users, and our potential customers.

About Ably
Ably provides cloud infrastructure and APIs to help developers simplify complex realtime engineering. Organizations build with Ably because we make it easy to power and scale realtime features in apps, or distribute data streams to third-party developers as realtime APIs. We support multiple open protocols and a growing number of third-party integrations. Take a look at what our customers say about us and our realtime APIs :).

About the author
Kieran Kilbride-Singh: Writer + marketer with enough technical know-how to be dangerous in GitHub repos. He's been writing about tech for five years, first flexing his fingers on topics like interoperability in IoT devices.

What is a Distributed Systems Engineer?

Michael Ade-Kunle — Thu, 28 Nov 2019 16:36:20 +0000

In the last few months at Ably we’ve spoken with hundreds of candidates for our Lead Distributed Systems Engineer and Distributed Systems Engineering roles. We’ve been surprised by how varied each candidate’s knowledge has been. It got us wondering if the challenge in finding the right people is that there is no clear definition of what skills are required to excel in this role.

Give that we've been working on our distributed realtime messaging platform, global cloud network, and realtime APIs for over six years, we think we're qualified enough to take a stab at defining what a distributed systems engineer needs.

If you want to become a distributed systems engineer, believe you are one, or want to recruit one for your team then here’s Ably's opinionated guide on the concepts you should have a thorough understanding of.

The concepts a Distributed Systems Engineer needs to know

Microservices or SoA is not a distributed system

Here's an example of a simplistic design of a service based architecture with horizontal scalability:

There’s not much “distributed” about this system. There are multiple hosts and network interconnections but they are tightly coupled. And their network interactions are reliable, have low-latency, and are predictable. Genuinely distributed, in our view, means:

Systems where nodes are distributed globally
Network interactions are unpredictable and can create partitions
Nonetheless those nodes work together to create a predictable outcome

Distributed systems, at scale, involve state being distributed and re-balanced across the system, reacting as nodes are added and removed, and they do this in spite of the unpredictability that is inherent in a global system.

Understanding hash rings is a pre-requisite

If you think a hash ring has something to do with a criminal cannabis organization, then that’s certainly amusing, but unfortunately means you’re missing knowledge of a common pattern used for distributed systems.

If the above doesn’t look familiar, then I recommend you start by diving into how popular distributed systems work, all of which rely on the ideas behind a consistent hash ring. See:

Gossip protocols and consensus algorithms underpin everything

Large distributed systems usually have to track changes in cluster topology in response to network partitions, failures, and scaling events. Various protocols exist to ensure that this can happen, with varying levels of consistency and complexity. This needs to be dynamic and real time because:

Nodes come and go in elastic systems
Failures need to be detected quickly
Load and state need to be rebalanced in real time

With a stateful system like Ably, state also needs to be moved in real time between new and old nodes whilst providing continuity throughout.

If you have never worked with Gossip or consensus algorithms, then I recommend you read up on:

Gossip protocol
Paxos protocol
Raft consensus algorithm
Popular consensus backed systems like etcd and Zookeeper, and gossip backed systems like Serf

Eventually consistent data types and read and write consistencies

Generally in a distributed system, locks are impractical to implement and impossible to scale. As a result, trade-offs need to be made between the consistency and availability of data. In many cases, availability can be prioritised and consistency guarantees weakened to eventual consistency, with data structures such as CRDTs.

If you’re not familiar with CRDT or Operational Transform, the concepts of variable consistencies for queries or writes to data in a distributed data store, then you’ve got some reading to do:

Operational Transform — originally implemented by Google in their Wave product and now in Google Docs. It has uses in collaboration apps, but OTs are complex and not widely implemented.
Conflict-free Replicated Data Types or CRDT provides an eventually consistent result so long as the data types available are used. Used by Riak distributed database and presence in Phoenix.
Consistency levels for both read and writes in distributed databases like Cassandra.

Deep understanding of network protocols

In a distributed system, you’ll almost certainly be working within all layers of the networking stack. Ably relies extensively on various higher level protocols such as HTTP, WebSockets, gRPC, and TCP sockets. But without a deep understanding of those protocols and the full stack of protocols they rely on all the way down to the OS itself, you’ll likely struggle to solve problems in a distributed system when things go wrong.

Take for example the following request or WebSocket connection which involves all of the following. At each layer you should be confident in your understanding and ability to debug problems at a packet or frame level:

DNS protocol and UDP for address lookup
File descriptors (on *nix) and buffers used for connections, NAT tables, conntrack tables etc.
IP to route packets between hosts
TCP to establish a connection
TLS handshakes, termination and certificate authentication
HTTP/1.1 or more recently 2.0 used extensively by gRPC
WebSocket upgrades over HTTP

And that’s not all…

From our perspective of operating a truly global and distributed system, a working understanding of the specific concepts described above is what we expect from a distributed systems engineer. Before that you need to also be a solid systems engineer. This requires you to have fundamentals in place such as programming languages, general design patterns, version control, infrastructure management, and continuous integration and deployment systems.

Further reading

About the author

I’m Matt, technical CEO and co-founder of Ably. I'm very interested in realtime problems, realtime engineering, and where the industry is headed. That’s the reason I co-founded Ably, which provides cloud infrastructure and APIs to help developers simplify complex realtime engineering. Organizations build with Ably because we make it easy to power and scale realtime features in apps, or distribute data streams to third-party developers as realtime APIs.

We're hiring across our engineering and commercial teams, including Distributed Systems Engineering roles so check out our job board. Not looking for a role but know someone who is? Refer an engineer that we employ and we'll send you $3k as a thanks. One email = $3k.

And if you’re interested in chatting to me about realtime problems, distributed systems, or this article, please do reach out to me @mattheworiordan or @ablyrealtime on Twitter.

Solving the WebRTC signaling challenge

Michael Ade-Kunle — Mon, 25 Nov 2019 16:19:34 +0000

Even if you think you don’t know what WebRTC is, chances are you are pretty well-acquainted with it. Why? Because everyday web operations rely on it. The article below describes a common challenge developers encounter when employing WebRTC under the hood and how to solve it, with links to further information.

WebRTC is a realtime communication standard that is baked right into the web browser. It enables developers to build applications that allow things like voice or video calling as well as sending arbitrary data (which Google Stadia uses to control cloud games for example). If you ever did a voice or video call using Facebook Messenger or Google Duo/Meet/Hangouts - then you’ve experienced WebRTC already.

Interestingly though, I am not here to discuss what WebRTC is or has (you can find more information about that on my blog, BlogGeek.me), but rather what it lacks and how to solve it. WebRTC lacks signaling. By signaling I mean the ability to find the person you want to communicate with and negotiate the communication terms (is this a video session? Voice only? What codecs will be used? etc). WebRTC will do a fabulous job connecting a session and making sure audio and video are crisp to the level available by your network. But for that to happen your application first needs to have a signaling channel and protocol in use.

So if WebRTC lacks signaling, this is a part developers need to figure out on their own. The messages that WebRTC wants you to send on its behalf are a set of SDP blobs. WebRTC handles the creation and parsing of these SDPs but not the sending and receiving part.

You as a developer, need to decide how to send them. Some use XMPP as their protocol of choice for such messages. Or resort to MQTT. Others use SIP (which is quite common in VoIP). For the most part, though, I’d say that developers tend to invent and use their own proprietary protocol here and just use WebSocket or Comet type solution like XHR.

Many of the developers who implement WebRTC in their apps make two good friends along the road:

Node.js - seems like the winner in many signaling projects these days
GitHub - where code can be found

The challenge is that there’s no popular and proven GitHub project for WebRTC signaling. All of them require a lot of care and love to get them to production readiness.

Which is why there are developers who end up opting to not run their own signaling service, but rather “rent” one - from services like Ably Realtime.

Why would someone prefer using a third-party managed service for WebRTC signaling and not take the route of self development? For the same reasons you host machines on AWS and not open your own data center:

Someone else needs to take care of uptime, monitoring, security, updating and dealing with the nuances of supporting multiple browsers, operating systems and SDKs
That vendor is also responsible for scaling the service to meet your growing demands. This is double try in WebRTC where all of these messages are “stateful” - something that makes scaling even harder
You get to focus on what’s important to you - the messages and state machines that drive your application, and nothing more

Ably has put together a series of tutorials on how to implement WebRTC apps with Ably Realtime as the underlying signaling service - You can try it out yourself by following their simple steps.

The way I see it, there are 3 main ways to develop a communication with WebRTC these days:

DIY - by using GitHub and open source
Semi-managed - using a vendor to manage your signaling and another vendor to manage your NAT traversal
Fully managed - going and using a CPaaS vendor that has it all

Why the middle ground of semi-managed? Because it has less vendor lock-in characteristics to it and gives better flexibility in mixing and matching components that you may need. I’d especially suggest it for those who are considering the DIY route - because it will make their lives easier by reducing the non-functional aspects of development needed, while still letting them maintain a large bulk of their IP.

What’s your preferred signaling method for WebRTC? More information about WebRTC signaling servers is available on BlogGeek.me. Jump in to Ably WebRTC signaling solutions by browsing Ably docs or experimenting with a free account. If you have any particularly good solutions to this issue that you feel would benefit this article, get in touch with Data in Motion editors.

Blog by Tsahi Levent-Levi, Author of BlogGeek.me as well as CEO & Co-founder at testRTC. He also has online courses (free and paid) at webrtccourse.com

DNS issues? Five practical strategies to remove single points of failure

Michael Ade-Kunle — Tue, 19 Nov 2019 11:22:40 +0000

June/July 2019’s Cloudflare incidents got the world thinking about additional safeguards against ‘unlikely’ DNS failure. The article below sheds light on five strategies for coping with these unlikely - but possible - DNS failures, as well as general advice for service reliability.

You may think a domain name system (DNS) outage is a rare possibility for your company. However, recent statistics paint a different picture. For example, a survey published in October 2018 found that 68% of the top 50 Fortune 500 companies used only a single provider to serve their DNS records.

The research concluded that nearly three-quarters (72%) of companies polled were vulnerable to DNS attacks. Besides businesses using only one provider, some only have DNS servers on an internal network.

Political issues can also cause DNS problems. This happened in 2004 with the national domain in Slovakia, which got compromised.

The article below describes five things you can do to get rid of single points of failure associated with a DNS.

Strategy 1: Understand root server issues

While root server issues are not really possible to resolve, the likelihood of them occurring is incredibly low. Knowing what could go wrong aids decision-making.

The first thing we need to understand about DNS is that, at the core, there are 13 root name servers that are ultimately responsible for delegation of every single domain. If these go down, once the time-to-live (TTL) dates for domains expire, the domain name system as we know it is down. The TTL is like an expiration date that tells a local resolver or recursive server how long to keep a DNS record in its cache.

The Internet Corporation for Assigned Names and Numbers (ICANN) is the group that oversees the interconnected link of unique identifiers that lets all internet-connected computers find each other. However, as you can see from the list below, it’s responsible for providing service for only one of these 13 root name servers. Otherwise, it has delegated resolution to 12 independent organizations spread around the globe.

List of Root Servers

Source https://www.iana.org/domains/root/servers, visually displayed at https://root-servers.org/

While the risk of the entire domain name system being unavailable is low, it is not impossible due to the following reasons:

i) Governmental control

While theoretically as of 2016 this is no longer possible since the US passed over control, given 10 out of 13 root servers are in the US, it’s still plausible. Only Netnod (Swedish), RIPE (European), and WIDE (Japanese) are domiciled outside the US.

ii) DDoS attacks

Given the scale of commercial organisations such as Verisign and Cogent, and the fact they have American government backing, this seems unlikely to be achievable without tremendous cost. All IP addresses support multicast which means each IP may represent any number of server endpoints. This allows traffic to be serviced close to the request, and also allows for attacks to spread across multiple servers. In 2007, when Anycast was only partially implemented on root servers, servers running with Anycast saved the day.

iii) Spoof / IP hijacking

This is hard to do at scale, easier in smaller closed networks. In these cases, such problems are not feasibly resolvable at an organizational level. Similarly, the likelihood of them happening is far lower than other possibilities.

Given the above, the likelihood of a root-level domain resolution for a gTLD (generic) such as .com or .info or a ccTLD (countries, two-letter codes) .uk or .jp domain being resolved to an authoritative DNS server is pretty much guaranteed.

Apart from closed network attacks, having any strategy to circumvent an attack on the global root servers is probably not worth it, as all the other services a user/server relies on will probably be unavailable anyway, i.e., the end-to-end service will still be down even if your service is somehow still up.

To summarize Strategy 1: The best approach to take in this case is to be aware of the root server issues that are possible. Outside of the broad categories for such attacks, stay mindful of specific trends that may pose higher-than-average risks for this kind of failure.

Strategy 2: Know your issues with TLD (top-level domain) authoritative servers

The top-level domain is the part of an internet address that appears after the period in the address, such as .com, .gov or .edu. In terms of availability, not all domains are equal - Chinese root level domain went down in Aug 2013. The key thing to bear in mind is the smaller the domain, the less likely it has infrastructure robust enough to defend against attacks.

Conversely, Verisign operates .com and has clearly invested heavily in its Internet service infrastructure, for the two root servers (A&J), and as well as .com, .name, .net

Once again, you can’t really do much about a TLD going down, but you can choose a TLD that is more likely to remain up under a large-scale attack, or even as a result of a software fault.

In this case, you can and should put your resources toward researching which TLDs are most likely to go down versus which ones are most reliable. The .io domain has a few horror stories about going down, being compromised etc. The domain is also controversial politically. In fact, researching this article has led us to purchase the ably.com and ably.net domains, and we’ll be migrating over domains from *.ably.io to these domains in due course.

In general terms, it's worth remembering that you are in control of your DNS, and decisions you make may affect reliability and availability of your domain. Historically, companies ran their own authoritative name servers. Given reliability issues of running DNS servers yourself, it was common for companies to run a primary name server within their network, and run a secondary for another business. That business, in turn, would run your secondary. Once cloud infrastructure emerged, DNS-as-a-service provided DNS resolution and authoritative name servers.

By using Anycast, DNS providers were able to dramatically improve the reliability of the authoritative name servers by resolving DNS queries close to the originating DNS client and also resolving a few authoritative name server IPs to numerous backend servers. Anycast, used for both root and authoritative servers means performance and reliability if implemented correctly.

Choosing a TLD that is unlikely to change hands and is powered by some of the biggest domain infrastructure providers in the world would be a wise safeguard. Moreover, you have the option of using two TLDs as another precautionary measure against this kind of failure.

To sum up Strategy 2: Consider that the likelihood of this kind of DNS failure is incredibly low as long as you pick a reliable TLD. If you don’t take the time to learn about the options and go with trustworthy TLDs, the chances of an outage go up.

Strategy 3: Understand name server issues

Here, we’ll take a closer look at how your DNS management choices can make a difference regarding your likelihood of DNS failure. Being mindful of some characteristics can help you make smarter DNS provider choices.

First, your DNS provider must use Anycast. It allows routers to choose the desired path based on several factors. They include the least-congested route, the closest one or the option with the least latency. Anycast speeds up the DNS resolution process for users. Besides using Anycast, your DNS provider must have tremendous scale and be large. Then, it can manage increased web traffic as your internet presence grows.

Secondly, your DNS provider should not be the same company that services your endpoints. Cloudflare’s incident showed us what can happen with that setup. Plus, aim to use a multi-DNS provider with APIs to keep the two synchronized. If you’re an endpoint provider that offers client software, your goal is to have multiple endpoints on different domains.

Finally, don’t overlook the need to regularly check for upcoming domain expiration or SSL certificates going out of date.

When you outsource your DNS management to a provider that uses Anycast, and ideally work with multiple providers, you are better likely to prevent failures. Remember that the likelihood of them happening depends on which choices you make. The suggestions above should help you avoid costly mistakes.

Strategy 4: Take a preventive approach to avoid renewal issues

If your domain name expires, it could open an assortment of other problems that cause headaches and wasted time. For example, there is normally a grace period of approximately a month when you can proceed with the renewal.

However, if you don’t, another person could buy the domain once it expires. After all, you no longer own it then, so it’s up for grabs. That could mean the website people have associated with your internet destination for years is no longer under your control. Thus, all the time and money spent building your brand and gaining name recognition in the marketplace become useless.

If your SSL certificate expires, it could make your website look illegitimate. For example, if a person using the Google Chrome browser navigates to a website with an expired SSL certificate, they see a warning that says it’s not advisable to continue to the site.

Google advises the internet user to go “back to safety.” However, they can override that choice and continue to look at the website by selecting an advanced option provided on the same page. Many people understandably get scared off and decide not to proceed, especially if the site collects payment information or other sensitive details.

It should be clear why it’s necessary to devote significant resources to stopping these kinds of renewal issues from happening. Fortunately, the biggest thing you must make available is commitment rather than time. That’s because there are handy domain name monitoring services that let you track expiration dates and prevent hijacking.

Expirations represent a kind of DNS failure that’s wholly preventable, provided you stay on top of the relevant dates and decide to renew them in time. On the other hand, the risk is substantial if you don’t take expiration dates and renewal timeframes seriously.

In addition to the expirations of your externally facing domains and security certificates — the ones members of the public see — don’t forget to keep tabs on any internal domains and certificates. Your company may have some of those for its intranet.

Strategy 4 in summary: Keep track of domain name expiry dates.

Strategy 5: Safeguard your ecosystem

We’ve looked at some specific things that can be done to reduce failure points with a DNS. In this final main section, let’s explore some of the reliable things you can do to safeguard your ecosystem.

As a start, use Anycast for everything. We’ve already gone over how and why Anycast shortens the DNS resolution time. It also provides a better experience for end-users by reducing bottlenecks.

Next, remember that coupling your endpoints and DNS zone control in one provider is not a good idea. Yes, Cloudflare, we’re looking at you and will learn lessons from the associated incidents.

If you must use Cloudflare and their DNS, then use their CNAME setup approach. Note that this has the downside of one additional DNS lookup, but that’s arguably a fair compromise.

You may be wondering about using a DNS service that manages it through one or more providers. It’s difficult to give a clear-cut answer regarding that option that applies to every situation. That’s because any such service would come with numerous features. You then have a tradeoff where you decide whether the features are worth the possible risks of a company that lets outside providers assume management responsibilities.

The alternative is an Anycast provider that routes DNS to one of two providers. The issue with that, though, is that it introduces a new single point of failure...

Safeguarding your ecosystem should be an ongoing concern instead of something you prioritize for a while and then forget about soon afterward. When you take these suggestions to heart, you’re making meaningful progress in reducing your risk of experiencing single points of failure with your DNS.

DNS crisis prevention: Three ideas for service providers

The information above relates to consumer sites. However, it’s not necessarily applicable to service providers. If you’re in that category, here are some specific tips.

1) Use multiple DNS endpoints with failover strategies in your clients. Complexity is high, as you need to deal with additional logic in all your client software to handle failures and know what failures require routing to alternative endpoints. Plus, other endpoints need to be hardcoded, and clients need a retry and backoff strategy.

This is the approach we have taken at Ably for Ably SDKs, but it does not work for open protocols. Stream.io had a similar issue and came to the same conclusion.

Stream originally mentioned it would implement client-side failover, which would mean clients would skip failing servers. However, according to the data from GitHub, it looks like it decided against that approach after all.

2) Having a third party manage failover DNS is not really a viable solution in all cases. How confident are you that the system that controls the failover will be available and operational at the time it’s needed?

3) You should also think about offering your service without a DNS lookup at all, and instead use Anycast to route traffic by IP to the closest healthy server. You will almost certainly end up with issues with devices that connect over TLS and don’t support custom server name.

Conclusion

You now know that single points of failure with your DNS are possible, but also largely preventable. Doing thorough research about your domain name options and providers is an excellent early step to take as you determine which companies are best equipped to meet your needs. Then you just need to devote resources to consistently minimising your risk, as suggested in the list above.

If you’re interested in learning more about any of the topics mentioned here, browse the section below.

The above article was written based on Ably Realtime's experience providing cloud infrastructure and APIs that help developers simplify complex realtime engineering. Ably makes it easy to power and scale realtime features in apps, or distribute data streams to third-party developers as realtime APIs. To get started create a free Ably account or talk to sales.

Further Reading
Resources about major domains or provider issues:

A list encompassing literally hundreds of issues (Ianix)
Cloudflare's detailed post-mortem of the 2019 incidents
A topline overview of the .de domain outage (Sans Technology Institute)
Details of the .st domain outage (Blorn.com)
Some coverage of the puri.sm domain outage that occurred in 2018 (from puri.sm)
A UK tech blogger describes his experience with a .io outage (haydenjames.io)
Overview of a DNS outage that affected Microsoft Azure in May 2019 due to a migration mistake (Build Azure)
Details about another Microsoft issue that resulted in the deletion of - Microsoft Azure server records
Information on The Dyn Attack - probably biggest in history in terms of impact (Wikipedia as starting point)
Two articles about safeguards through a multi-DNS strategy: perspectives from GlobalDots and InfoQ
A perspective on why using two DNS providers could help with surviving an attack, and why it’s not a widely adopted practice - yet (Internet Society)
Some generally good advice about DNS tips to avoid pitfalls (although the one about increasing the TTL length is not ideal) (Canopy.co blogpost)

Other reading:

Most companies still vulnerable to DNS attacks - a view expressed in articles in Silicon Republic and The Register. Interestingly, 44% of SaaS platforms still use one DNS provider. Looking at BBC for example, all DNS servers are in their own network, HubSpot uses Cloudflare exclusively (two DNS endpoints), Salesforce has their own DNS, which route to Dyn and Neustar, Amazon apparently use Dyn and UltraDNS.
A good source of DNS stats. What’s interesting of course is the number of reported issues with TLDs is disproportionately by the tiny TLDs i.e. anecdotally, the smaller the TLD, the more likely you will be affected.
Details on Anycast roll out with RIPE and impact (Ripe.net)

SQS FIFO Queues: Message Ordering and Exactly-Once Processing Guaranteed?

Michael Ade-Kunle — Fri, 15 Nov 2019 16:31:48 +0000

Converting a distributed system to FIFO and exactly-once message processing requires considerable user effort. This article looks at why this is the case, serving as a guide about what to bear in mind when implementing a FIFO and exactly-once message processing scenario using SQS FIFO queues.

Between the lines of SQS FIFO Queues

Reading Amazon's marketing around SQS FIFO queues, we're led to believe this provides exactly once and FIFO order processing as an out-of-the-box solution. Correspondingly one could easily believe that replacing any SQS queues in an existing distributed system with SQS FIFO queues would be the only change needed to convert the system to FIFO and exactly-once processing. It would indeed be really nice to believe Amazon engineers had done all the heavy lifting and this difficult functionality was wholly available at the flip of a switch.

The reality is somewhat more complicated. While Amazon delivers their side of the deal, it seems pretty impossible for Amazon to complete exactly what it says on the tin without user input. What does this user input entail?

What is FIFO Message Ordering?

First it's useful to break down the function of FIFO order processing. A FIFO SQS queue guarantees messages will be served in the same order in which they were received. (Alternatively, it could group messages according to their Group ID field and guarantee ordered delivery in the group, but not between different groups. Each group of messages could then be treated as FIFO queue in itself for the purposes of analysis that follows) . That, however, does not guarantee a FIFO delivery for the distributed system using the FIFO SQS queue, if we define the system to encompass the producer/sender(s) as well as the consumer/receiver(s) of the messages. For example, in the simple case of a single producer and a single consumer the system would look like:

What nuances should we know about?

It is indeed guaranteed that a message that gets in the SQS FIFO queue will not get served while there are older messages still in the queue. That, however, does not mean that the consumer/receiver will get the messages in the same order the producer/sender submitted them to the SQS FIFO queue. For this to be guaranteed two additional assumptions need to be satisfied:

The queue must get messages in the same order they were produced by the producer/sender
The consumer/receiver must get messages in the same order they come out of the SQS FIFO queue

In order for these assumptions to be satisfied we need a single, synchronous producer/sender and a single, synchronous consumer/receiver transmitting data over a synchronous transport layer.

Adding asynchronicity to any part of the system removes the ordering guarantee for that part. The result is the ordering guarantee for the entire system is then removed as well. Looking into this in more detail - it is fairly obvious how an asynchronous transport layer could break message ordering. It is also to a large degree obvious that an asynchronous sender cannot guarantee the order of messages sent. An asynchronous receiver/consumer could break the FIFO order of messages under the assumption that it needs to complete processing each of them in FIFO order. Last but not least, having multiple senders or receivers virtually amounts to having single-instance, asynchronous ones, i.e. it would also break the system-wide FIFO order of messages!

Exactly-Once Processing - Guaranteed?

Exactly-once processing would mean that any duplicated messages would not affect the state of the system, i.e. they will be recognised as duplicates and ignored. A SQS FIFO queue would require a unique message deduplication ID (or it would create one by hashing the message contents) together with each message and use that ID to discount duplicate messages. At first glance it seems that using an SQS FIFO queue alone as opposed to a non-FIFO one would be enough for the entire system to attain the exactly-once processing property. Taking a deeper look, however, some subtle caveats emerge meaning exactly-once processing is not guaranteed all the time, under any conditions.

There are two limits introduced by the SQS FIFO queue implementation that make it impossible to guarantee exactly-once processing in all possible cases.

The first one is the max 5 minutes timeout on storing a given message deduplication ID. So if a duplicated message arrives more than 5 minutes after the original, the SQS FIFO queue would accept it and continue to process it. If exactly-once were a mission-critical requirement, other parts of the distributed system would have to take care of guaranteeing message uniqueness.

The second limit of the SQS FIFO queue implementation comes from the kind of “exactly-once” guarantee it gives to the consumer/receiver. It does not guarantee exactly-once delivery but it does guarantee exactly-once processing. What this means is that the SQS FIFO queue has to receive acknowledgement from the consumer that a message is processed before it stops returning it on future message requests. In other words the message might actually get delivered more than once. The way this works in SQS FIFO queue implementation is by keeping the message inaccessible to other consumers for a certain period of time after it has been delivered to a consumer. This gives the consumer who received it the chance to process it and delete it from the queue. (Deleting the message from the queue serves as acknowledgement the message has been processed.) This period of time, known as visibility timeout, can last for up to 12 hours. The assumption is that this period will be long enough to allow for receiving and processing the message and sending a delete request back to the queue. This, however, is just an assumption and it might not hold in all cases. So again, if exactly-once processing is mission-critical across the system, the fact you’re using SQS FIFO queue is not itself a guarantee. For exactly-once processing to come as guaranteed, the consumer of the messages will have to run its own independent bookkeeping.

Combining exactly-once processing with FIFO has other potential negative impacts on performance for the SQS FIFO queues. To guarantee FIFO ordering no subsequent message will be served until the previous one has been acknowledged as processed or its visibility timeout expires. (An SQS FIFO queue will deliver a group of messages instead of a single one but that has only a minimal impact - the next group will not be delivered until the queue knows the last one to be delivered has been processed by the consumer.) It’s easy to see how the system throughput would rapidly deteriorate in scenarios with subpar connection quality and large visibility timeouts.

Theoretical Background

It is evident that SQS FIFO queues guarantee in-order and exactly-once processing only within the queue itself, not in the entire distributed system that would use the queue. An SQS FIFO queue is simply a more convenient building block to use in a distributed system aiming at system-wide FIFO message delivery and exactly-once processing and nothing more than that.

Taking a step back - in the distributed computing field there exists the Total Order Broadcast problem. The Total Order Broadcast specifies that messages have to be delivered to all participants in any order, so long as it is the same for everybody. FIFO delivery of messages in a distributed system is actually a custom case of the problem with additional restriction imposed on the order. Research on the Total Order Broadcast problem aids our understanding of the limitations of possible solutions to the FIFO delivery problem.

Simply put, the Total Order Broadcast problem is equivalent to the problem of achieving distributed consensus, i.e. having all participants in a distributed system agree on message delivery order. The distributed consensus problem has proven to be impossible to solve in the general case, as described in a groundbreaking paper known as “the FLP result”.

As a consequence, the existing distributed consensus / total order broadcast algorithms that guarantee successful completion in all possible cases impose some restrictions on the distributed system they run on. This in turn limits the number of possible cases in which they can be applied. Lifting the restrictions means the algorithm will fail in some cases (which does not necessarily render it unusable in practice). Not surprisingly, the theory is demonstrated throughout our discussion so far. To guarantee FIFO/exactly-once delivery, we continued introducing various restrictions on the distributed system e.g. synchronous transport layer, transactions on the consumer side, etc.

Conclusion

To conclude - the problem of FIFO/exactly-once delivery in a distributed system that we hoped might be solved using SQS FIFO queues alone turns out to be much harder than simple, marketing-oriented definitions would suggest.

While an SQS FIFO queue is certainly a very useful tool in building a solution to this problem, it does not constitute a solution by itself. Guaranteeing exactly-once publishing outside the queue is a problem that frequently falls to of distributed system developers to solve.

Further reading about problems typically encountered when ensuring against duplication across distributed systems - and how we solved them - is available to read in our Deep Dive into Implementing Idempotency across distributed systems.