Forem: Chris McFadden

RESTful API Versioning Best Practices: Why v1 is #1

Chris McFadden — Tue, 19 Sep 2017 18:06:37 +0000

Breaking Changes Bad! API Versioning Good!

As anyone who has built or regularly uses an API realizes sooner or later, breaking changes are very bad and can be a very serious blemish on an otherwise useful API. A breaking change is a change to the behavior of an API that can break a user’s integration and result in a lot of frustration and loss of trust between the API provider and user. Breaking changes require that users be notified in advance (with accompanying mea culpas) rather than a change that just shows up, such as a delightful new feature. The way to avoid that frustration is to version an API with assurances from the API owner that there will be no surprising changes introduced within any single version.

So how hard can it be to version an API? The truth is it’s not, but what is hard is maintaining some sanity by not needlessly devolving into a dizzying number of versions and subversions applied across dozens of API endpoints with unclear compatibilities.

We introduced v1 of the API three years ago and did not realize that it would be going strong to this day. So how have we continued to provide the best email delivery API for over two years but still maintain the same API version? While there are many different opinions on how to version REST APIs, I hope that the story of our humble yet powerful v1 might guide you on your way to API versioning enlightenment.

REST Is Best

The SparkPost API originates from when we were Message Systems, before our adventures in the cloud. At the time we were busy making final preparations for the beta launch of Momentum 4. This was a major upgrade to version 3.x, our market leading on-premise MTA. Momentum 4 included an entirely new UI, real-time analytics, and most importantly a new web API for message injection and generation, managing templates, and getting email metrics. Our vision was of an API first architecture – where even the UI would interact with API endpoints.

One of the earliest and best decisions we made was to adopt a RESTful style. Since the late 2000s representational state transfer (REST) based web APIs are the de-facto standard of cloud APIs. Using HTTP and JSON makes it easy for developers, regardless of which programming language they use – PHP, Ruby, and Java – to integrate with our API without knowing or caring about our underlying technology.

Choosing to use the RESTful architecture was easy. Choosing a versioning convention was not so easy. Initially we punted on the question of versioning by not versioning the beta at all. However, within a couple months the beta was in the hands of a few customers and we began building out our cloud service. Time to version. We evaluated two versioning conventions. The first was to put the versioning directly in the URI and the second was to use an Accept header. The first option is more explicit and less complicated, which is easier for developers. Since we love developers, it was the logical choice.

API Governance

With a versioning convention selected we had more questions. When would we bump the version? What is a breaking change? Would we reversion the whole API or just certain endpoints? At SparkPost, we have multiple teams working on different parts of our API. Within those teams, people work on different endpoints at different times. Therefore, it’s very important that our API is consistent in the use of conventions. This was bigger than versioning.

We established a governance group including engineers representing each team, a member of the Product Management team, and our CTO. This group is responsible for establishing, documenting, and enforcing our API conventions across all teams. An API governance Slack channel also comes in handy for lively debates on the topic.

The governance group identified a number of ways changes can be introduced to the API that are beneficial to the user and do not constitute a breaking change. These include:

A new resource or API endpoint
A new optional parameter
A change to a non-public API endpoint
A new optional key in the JSON POST body
A new key returned in the JSON response body

Conversely, a breaking change included anything that could break a user’s integration such as:

A new required parameter
A new required key in POST bodies
Removal of an existing endpoint
Removal of an existing endpoint request method
A materially different internal behavior of an API call – such as a change to the default behavior.

The Big 1.0

As we documented and discussed these conventions, we also came to the conclusion that it was in everyone’s (including ours!) best interest to avoid making breaking changes to the API since managing multiple versions adds quite a bit of overhead. We decided that there were a few things we should fix with our API before committing to “v1”.

Sending a simple email required way too much effort. To “keep the simple things simple” we updated the POST body to ensure that both simple and complex use cases are accommodated. The new format was more future-proof as well. Secondly we addressed a problem with the Metrics endpoint. This endpoint used a “group_by” parameter that would change the format of the GET response body such that the first key would be the value of the group by parameter. That did not seem very RESTful so we broke each group by into a separate endpoint. Finally we audited each endpoint and made minor changes here and there to ensure they conformed with the standards.

Accurate Documentation

It is important to have accurate and usable API documentation to avoid breaking changes, of the deliberate or unintentional kind. We decided to use a simple API documentation approach leveraging a Markdown language called API Blueprint and manage our docs in Github. Our community contributes and improves upon these open source docs. We also maintain a nonpublic set of docs in Github for internal APIs and endpoints.

Initially, we published our docs to Apiary, a great tool for prototyping and publishing API docs. However, embedding Apiary into our website doesn’t work on mobile devices so we now use Jekyll to generate static docs instead. Our latest SparkPost API docs now load quickly and work well on mobile devices which is important for developers who are not always sitting at their computer.

Separating Deployment from Release

We learned early on the valuable trick of separating a deployment from a release. This way it’s possible to frequently deploy changes when they are ready through continuous delivery and deployment but we don’t always publicly announce or document them at the same time. It’s not uncommon for us to deploy a new API endpoint or an enhancement to an existing API endpoint and use it from within the UI or with internal tools before we publicly document it and support it. That way we can make some tweaks to it for usability or conformance to standards without worrying about making a dreaded breaking change. Once we are happy with the change we add it to our public documentation.

Doh!

It is only fair to admit that there have been times where we have not lived up to our “no breaking changes” ideals and these are worth learning from. On one occasion we decided it would be better for users if a certain property defaulted to true instead of false. After we deployed the change we received several complaints from users since the behavior had changed unexpectedly. We reverted the change and added an account level setting – a much more user friendly approach for sure.

Occasionally we are tempted to introduce breaking changes as the result of bug fixes. However, we decided to leave these idiosyncrasies alone rather than risk breaking customer’s integrations for the sake of consistency.

There are rare cases where we made the serious decision to make a breaking change – such as deprecating an API resource or method – in the interest of the greater user community and only after confirming that there is little to no impact to users. For example, we deliberately made the choice to alter the response behavior of the Suppression API but only after carefully weighing the benefits and impacts to the community and carefully communicating the change to our users. However, we would never introduce a change that has a remote possibility of directly impacting the sending of a user’s production email.

Introducing v2 – Not Yet!

Who knows how long v1 of the SparkPost API will continue to reign. It’s certainly outlived my own expectations. What will be the driving reason to release v2 of the API? I would love to hear your thoughts on this topic. If you have any questions about our API versioning or would like advice on how best to solve your own API versioning challenges please don’t hesitate to connect on Twitter. And if you like building awesome APIs you are in luck because we are always hiring. To learn more about how we built our API check out How SparkPost Built the Best Email API for Developers.

This post was originally published on the SparkPost blog.

Operating DNS on the AWS Network: Challenges and Lessons

Chris McFadden — Thu, 08 Jun 2017 16:36:25 +0000

A DNS Performance Incident

At SparkPost, we’re building an email delivery service with high performance, scalability, and reliability. We’ve made those qualities key design objectives, and they’re core to how we engineer and operate our service. In fact, we literally guarantee our service level and burst rates for our Enterprise service level customers.

However, we sometimes encounter technical limitations or operating conditions that have a negative impact on our performance. We recently experienced a challenging situation like this. On May 24, problems with our DNS infrastructure’s interaction with AWS’ network stack resulted in errors, delays, and slow system performance for some of our customers.

When events like this happen, we do everything we can to make things right. We also commit to our customers to be open and transparent about what happened and what we learn.

In this post, I’ll discuss what happened, and what we learned from that incident. But I’d like to begin by saying we accept responsibility for the problem and its impact on our customers.

We know our customers depend on reliable email delivery to support their business, security, and operational needs. We take it seriously when we don’t deliver the level of service our customers expect. I’m very sorry for that, as is our entire team.

Extreme DNS Usage on AWS Network Hits a Limit

Why did this slowdown happen? Our team quickly realized that routine DNS queries from our service were not being answered at a reasonable rate. We traced the issue to the DNS infrastructure we operate on the Amazon Web Services (AWS) platform. Initially we attempted to address query performance by increasing DNS server capacity by 500%, but that did not resolve the situation, and we continued to experience an unexplained and severe throttling. We then repointed DNS services for the vast majority of our customers at local nameservers in each AWS network segment, which were not experiencing performance issues. This is not the AWS-recommended long-term approach for our DNS volume, but we coordinated it with AWS as an interim measure that allowed us to restore service fully for all customers about five hours after the incident began.

I’ve written before about how critical DNS infrastructure is to email delivery, and the ways in which DNS issues can expose bugs or unexpected limits in cloud networking and hosting. In short, email makes extraordinarily heavy use of DNS, and SparkPost makes more use of DNS than nearly any other AWS customer. As such, DNS has an outsized impact on the overall performance of our service.

In this case, the root cause of the degraded DNS performance was another undocumented, practical limit in the AWS network stack. The limit was triggered under a specific set of circumstances, of which actual traffic load was only one component.

Limits like these are to be expected in any network infrastructure. But one area where the cloud provides unique challenges is troubleshooting them when the network stack is itself an abstraction, and the traffic interactions are much more of a shared responsibility than they would be in a traditional environment.

Diagnosing this problem during the incident was difficult for us and the AWS support team alike. It ultimately required several days of effort and assistance from the AWS engineering team after the fact to recreate the issue in a test environment and then identify the root cause.

Working with the AWS Team

Technology stacks aside, we know how much our customers benefit from the expertise of our technical and service teams who understand email at scale inside and out. We actually mean it when we say, “our technology makes it possible–our people make the difference.”

That’s also been true working with Amazon. The AWS team has been essential throughout the process of identifying and resolving the DNS performance problem that affected our service last week. SparkPost’s site reliability engineering team worked closely with our AWS counterparts until we clearly understood the root cause.

Here are some of the things we’ve learned about working together on this kind of problem solving:

Your AWS technical account manager is your ally. Take advantage of your account team. They’re advocates and guides to navigate AWS internal resources and services. They can reach out to product managers and internal engineering resources within AWS. They can hunt down internal information not readily available in online docs. And they really understand how urgent issues like the one we encountered can be to business operations. If a support ticket or other issue is not getting the attention it deserves don’t hesitate to push harder.
Educate AWS on your unique use cases. Ensure that the AWS account team–especially your TAM team and solution architect–are involved in as much of your daily workflow as possible. This way, they can learn your needs first hand and represent them inside of AWS. That’s a really important part of keeping the number of unexpected surprises to a minimum.
Run systematic tests and generate data to help AWS troubleshoot. The Amazon team is going to investigate the situation on their end, and of course they have great tools and visibility at the platform layer to do that. But they can’t replicate your setup, especially when you’ve built highly specialized and complex services like ours. Running systematic tests and providing the data to the AWS team will provide them with invaluable information that can help to isolate an unknown problem to a particular element of the platform infrastructure. And they can monitor things on their end during these tests to gain additional insight into the issue.
Make it easy for engineers on both teams to collaborate directly. Though your account team is critical, they also know when letting AWS’ engineers and your engineers work together directly will save time and improve communication. It’s to your advantage to make that as easy as possible. Inviting the AWS team into a shared Slack channel, for example, is a great way to work together in real-time–and to document the interactions to help further troubleshooting and reproduce context in the future. Make use of other collaboration tools such as Google docs for sharing findings and plans. Bring the AWS team onto your operations bridge line during incidents and use conference calls for regular check-ins following an incident.
Understand that you’re in it together. AWS is a great technical stack for building cloud-native services. But one of the things we’ve come to appreciate about Amazon is how openly they work through hard problems when a specialized service like SparkPost pushes the AWS infrastructure into some edge cases. Their team has supported us in understanding root causes, developing solutions, and ultimately taking their learnings back to help AWS itself continue to evolve.

The AWS network and platform is a key part of SparkPost’s cloud architecture. We’ve developed some great knowledge about leveraging AWS from a technical perspective. We’ve also come to realize how important support from the AWS team can be when working to resolve issues in the infrastructure when they do arise.

Looking Ahead

In the coming weeks, we will write more in detail about the DNS architecture changes we are currently rolling out. They’re an important step towards increasing the resilience of our infrastructure.

Whether you’re building for the AWS network yourself, or a SparkPost customer who relies on our cloud infrastructure, I hope this explanation of what we’ve learned has been helpful. And of course, please reach out to me or any of the SparkPost team if you’d like to discuss last week’s incident.

This post was originally published on SparkPost's blog.