Forem: RJ Zaworski

Unpacking the Technical Blog Post

RJ Zaworski — Thu, 10 Aug 2023 21:31:42 +0000

Technical blog posts are a fantastic place to share knowledge and lessons learned. They're also a powerful tool for building visibility and (in the right setting) attracting customers. While they share DNA with other types of technical writing, the marketing opportunities surrounding technical posts make them a distinct form.

Have a goal

Before writing anything, get clear on the goal. Some of the motivation behind a technical post may be altruistic (sharing knowledge, advancing a theory, challenging assumptions, or otherwise joining in a broader community conversation), but even altruism can bring positive attention to the author. There's likely some goal in the mix, and implicitly or otherwise it likely comes down to:

Selling something. A technical post may be a good opportunity to introduce the audience to products, services, mailing lists, or even open job listings.
Building visibility. Even a "sales-less" blog can raise the profile of the individual or organization that published it.

These goals will likely stretch beyond a single post. A campaign to attract new sign-ups for a web application may be supported by a series of posts that all aim at the same segment of prospective users.

Whatever the goal, it will inform the full cycle of the technical blog post---from who it's written for, to what it's written about, to how it's published and promoted.

For now the thing to keep in mind is that the post does have a goal, that the goal likely fits into a broader campaign, and that the goal is central to the post's construction.

Identify an audience

The audience for any given technical topic will be much smaller than the Internet-going population as a whole. But there are technical audiences, and then there are technical audiences. Depending on the post's goals, it may aim to reach beginners, experts, kindred spirits, ideological opposites, or some combination of all four.

Different audiences come with conflicting goals. Writing aimed at beginners will alienate advanced readers, while technical minutiae presented without background or context will alienate almost everyone else. Two good questions for finding the middle ground are:

Will the audience be generally familiar with foundational concepts? If the post references algorithms, standards, or practices in widespread use, it may be safe to leave them at a name and a link. If they're more obscure (at least for the audience at hand) they may benefit from more explanation in line.
How much familiarity will the reader have with the specifics? A new technology (or less experienced audience) will require more explanation than a familiar one applied in a new way. The same goes for ideas and references that aren't yet part of industry canon.

Just as readers arrive with different levels of experience, they also come from different points of view. A post that challenges accepted dogma will need to offer more evidence than one built on existing beliefs.

No post will resonate with very reader, and that's OK. There's no such thing as one-size-fits-all content, and no amount of writing and re-writing will change that. The game is to identify a target audience and make good use of their time.

Pick a topic that matters

With the goal established and the audience defined, the next step is to winnow down a long list of potential topics to the ones that really matter.

This shouldn't be a spur-of-the-moment decision. Developers know that there's an art to not writing software, and the same is true of blogging. Ideas are cheap, and skipping over so-so ideas early means more time for developing the really good ones.

Keep a list of ideas

The first step in prioritizing what to write about is to list out all the possibilities. Do it in a spreadsheet, text file, or the nearest note-taking app, but do it in writing---it's a list that will grow over time.

Once it's filled out, this list will likely include ideas for posts that:

Demonstrate a technique
Announce new projects, products, or features
Answer burning questions
Challenge assumptions
Share lessons

This list is also a good place to store new ideas that turn up while writing. Each post gets to make exactly one point. If new ones pop up and they're worth following up on, they'll make great posts of their own.

Prioritize the idea list

It's easier to generate ideas than to pare them back down. The real art is in ruling out possible posts to get down to the ones that will have a genuine impact.

As a starting place, reflect on how each topic or idea might be received:

What will the reader learn?
How will the reader's perception of [tool|technology|etc] change?

If the answers aren't clear and compelling, move on. Time is precious, and a reader who isn't finding value won't stick around.

A second filter to apply is novelty. It's fine to overlap with other writing on the Internet, but a post that doesn't offer a fresh perspective, new idea, or meaningful contribution beyond existing content is a post that doesn't need to be written.

Finally, consider how receptive the audience will (or won't) be to a post's central theme. There's an audience for every position, and most technical content isn't overly bombastic or provocative, But when you're representing a broader organization or brand, the old adage that "any press is good press" is one to handle with care.

Write the post

Find a voice

Technical writing prizes clarity over creativity and drama. There's still room to adjust tone to better connect with an audience, however, and particularly in the less-formal setting of a blog.

For instance:

a conversational tone can help invite an unsure reader into a how-to article
a more assertive posture may create an air of authority for press releases or feature announcements
a challenging tone may help spark debate (and a more conciliatory tone may help disarm it)

Next, assign the narrator's point of view. The easiest way to do this is through pronouns:

Singular first-person pronouns (I, me) underscore accountability and may help an audience relate to a post-mortem or mea culpa.
First-plural pronouns (we, us) replace a singular author with a collective---useful for representing an organization's general opinions or sharing its work.
Avoiding pronouns keeps the narrator out of things and may reduce the emotion out of the pieces at large.

Different voices may be appropriate in different situations. The key is to be consistent. If an existing style guide is available for the publication, blog, or organization in question, use that. If it isn't, a little time invested in a lightweight approximation will save time in the long run. Are pronouns acceptable? Passive voice? Adverbs?

Answer those questions up front, and the details will follow.

Tell a story

Our brains come wired to appreciate a good story. A little emotion and shared humanity improve learning and retention over an impartial, procedural format, and increase the odds that a technical blog post will actually be read.

Translating technical content into narrative form follows a fairly straightforward formula. Instead of straightforward exposition, the narrative version will build around:

a hero (who is it?)
a quest (what is the hero out to do?)
obstacles (what must the hero must overcome?)
a climax (what is the final obstacle?)
a reward (has the hero grown? has the world changed?)

While the hero could be an operations team (in a post-mortem), a product manager (product release), or a software developer (a bug hunt), it may also be the reader herself. Tutorials and how-to articles often build around this structure:

By reading this post (quest), you (hero) will encounter the challenges (obstacles) on the way to a solution (climax) that you can use to accomplish some task (reward).

Whoever it features, the story should always be aligned to the post's target audience. "Middle-manager uses technology to slay recalcitrant corporate bureaucracy" may draw sympathy for a general audience, but it likely won't inspire readers drawn by a headline about a cutting-edge development framework.

Write the hook

A blog post's hook is simply a reason to keep reading. In one sentence (two at the most) it sets the stage and makes a promise about what's yet to come. It should answer:

What's the story? This may be explicit (as in a news lede), implicit (a destination revealed with little detail about the journey) or somewhere in between. A good hook will tease the story without giving everything away, piquing a reader's curiosity and drawing them deeper into the post.
What's the benefit? Make the post's "value" concrete by telling the reader what (beyond a good story) they will gain by reading on.

The hook comes first. Write it last. Better yet, find some collaborators to toss ideas around and write several versions together. Settle on the ones that feel most compelling. After the article's headline and marketing metadata, the hook is the next most important step for drawing a reader in. It's worth the time it takes to get it right.

Content: words, sentences, and examples

"Content is king." What was true to Bill Gates in the mid-90s is no less true today, and no amount of marketing will make a lousy post worth reading. It starts with a clear goal (check), a great idea (yep), and a (hopefully) receptive audience.

With technical content, the usual rules of good writing still apply. But the challenge of clear, economical writing is compounded by the challenge of creating clear, economical supporting examples that appeal to a wide range of audiences and individual learning styles.

Diagrams and illustrations can help a reader orient themselves within a multi-step discussion or workflow. They're most useful when simple and clearly labeled: a diagram that attempts too much may be better off as a standalone chart or infographic.
Code samples translate abstractions into concrete terms. Since they're easiest to understand when focused on a single idea, it's usually more helpful to construct a post's examples in isolation than to try explaining their place within more complicated (and likely unfamiliar) real-world systems.
Data aggregations, analyses, and visualizations are much more helpful than raw datasets. Simpler tables, charts, and algorithms are easier to understand and more likely to be used.

All examples, regardless of format, should include brief captions and links to relevant external references (data sets, source code, and so on).

Different posts will follow different conventions, but they should all treat the audience's time with the highest respect. The audience is filled with busy people who have other places to be. The central idea needs to be clear. Details need to clearly support it.

Remember: a post gets exactly one call to action, and it belongs near the end. A reader will only see it if they have a reason to keep reading.

Wrap it all up

A post's conclusion is a chance to restate key points and revisit the hook's promise. Did the post deliver?

Since it did, there is only one thing left to do. Make the ask. The post had a goal. Whether it was to sell something, build visibility, or just contribute to the broader body of knowledge, it's time to request something from the reader in return. That's the call to action (CTA), and there should only be one.

Hire me (us)
Buy my (our) product or services
Follow me (us) and/or share this post on social media
Review our job listing
...and so on.

Make this clear, keep it brief, and put it in context of the rest of the post.

Like what you've read? Signing up for our newsletter is the best way to discover great new posts like this one.
Want to stay in the loop? Follow us on social media for ongoing coverage.
These are the sorts of problems we solve every day. Did we mention that we're hiring?

With this, the post has come full circle. After drawing an interested audience and giving them a unique insight, feature, or story, the CTA makes good on the goal that drove the post in the first place.

Publish and promote

Professional marketers spend their days figuring out how to reach potential customers with the right message at the right time. It's a fuzzy problem with a vast search space and very few clear answers, but a technical blog post can get pretty far by focusing in on three things:

The channels that draw readers to the post
The conversion from the post's CTA
The analytics that connect channels to conversions

In other words: how do people find the post? is the post achieving its goal? and how do we know?

Marketing channels

Unless a post has the benefit of an established audience (via a newsletter, for instance) or a non-trivial marketing budget, readers will likely find it through one of three channels:

"Owned" media - via social media (Twitter, LinkedIn, etc) or forum (Hacker news, reddit) posts from accounts managed by the post's author. This channel depends on interesting headlines and compelling post previews.
"Earned" media - via shares or syndication from other social accounts
Search engines - via longer-run Search Engine Optimization (SEO) campaigns

Channel strategies may overlap, but don't count on it. Search engine rankings take time and careful keyword planning; social posts may disregard keywords entirely while presenting enticing, clickable headlines headline. "Evergreen" content will rarely be able to ride a wave of current events, but more topical content may quickly lose relevance.

The key is to pick a single channel and put the time in to develop it. While every channel has its strengths and weaknesses, steady ongoing efforts to attract followers (owned media); build relationships (earned media); and build credibility (search engines) will help them reach their full potential. Whatever the focus, one reliable hit will always be better than three or four near misses.

Finally, though additional channels (e.g. emails/newsletters, paid advertising, syndication, etc) may help boost reach in a more mature marketing campaign, they will only add---not replace---the big three.

Conversions

For the purpose of the post's goal, a conversion is a user answering the CTA.

For the purpose of tuning the post towards that goal, it's helpful to think of the post as its own little funnel. Multiple "conversions" lead up to the CTA, as a reader:

finds the post
finishes reading it

A post that readers aren't finding, or a channel that's underperforming, implies low visibility, a lack of interest, or both. It may be possible to tweak the post's preview (title, description, and opengraph/structured data) to improve discoverability. It may also just be a miss on audience, topic, or channel. These things happen. Time to take a different swing.

If readers aren't reaching the entire post, it may be down to a perceived lack of value---or to the promise being adequately fulfilled before the end. This is an easier problem than getting people in the door in the first place, but no less crippling to the blog post's goals.

Instrumentation and analytics

Google analytics' defaults will go pretty far, especially if inbound links include sensible utm parameters. The channel breakout will show where visitors are arriving from; with an instrumented call to action (if it's a form, button, or anything other than a link) it's a small step to track conversions the whole way through the campaign.

Conclusion

That's it: from ideation to promotion, a comprehensive look at crafting interesting, insightful technical blog posts. And don't forget to share a link as soon as that post's finished!

Managing Application State with Algebraic Effects

RJ Zaworski — Fri, 31 Dec 2021 19:41:31 +0000

"And how's that working out for you?"

You get used to the awe, horror, skepticism, and (occasionally) curiosity whenever someone learns out that your startup's built on top of DynamoDB. I get it. Starting with Dynamo means you're either supremely confident in your business model, domain model, and the data-access patterns that result--or utterly unconcerned about paying-as-you-go for the batteries that weren't included. Schema design is agony. We sunk far too much development time into JOIN-ing data and reconstructing other basic functionality. But as we learned to live with it we also got a very flat performance curve and a decent handle on what works and what doesn't.

Live and learn

If we went back and did it all again, two decisions made later in the game would have enormously improved our quality of life at the start.

Decision one: ditching our ORM. DynamoDB is not a relational database, and any library that tries to nudge in that direction is a fast road to the wrong access patterns and a lousy developer experience. Life got much better when Ashwin Bhat started tearing up the edges of that inappropriate abstraction in favor of the DynamoDB DocumentClient and a successful single-table design.

The other key change was to borrow heavily from effects-based programming for non-trivial state updates. At risk of ruining the punchline, most web apps are variations on the same four-part theme: load data, prepare changes, apply changes, repeat. Expressing the middle two stages in terms of "effects", even in a crude, homebuilt form, had a profound impact on our application as a whole, unlocking performance, improving visibility, and decreasing cycle times for new development.

An example

Before we get to the good stuff, consider a web service responsible for managing a team's roster. It probably includes a way to add a new user to a team, probably using a method like this one:

async function addUserToTeams(teamIds: TeamId[], userId: UserId) {
  await Promise.all(teamIds.map(async function (teamId) {
    const members = await Teams.getMembers(teamId);
    if (!members.includes(userId)) {
      await Teams.addMember(team, user);
    }
  }));
}

This method is neat and concise, but it comes with at least a few, ahem, concerns.

The business logic around adding users who are already part of the team is poorly defined. Right now we simply skip the operation while other changes take place in parallel--we don't make any attempt to roll back, or notify the caller about a partial failure.
Whatever the business logic ought to be, interspersing data access throughout the implementation will make it harder to test.
Likewise, any exception-handling logic must either exist at the data layer (limiting coordination across parallel requests) or implemented as a one-off inside this method.

All of these are solvable problems, however, and teasing out the load-prepare-apply pipeline implicit in the naïve implementation is a good place to start.

Changes as effects

As a first step, let's separate computing the changes to make (the business logic) from how they're sent to the database. We'll then glue them together using a list of effects.

type MembershipEffect<T> = {
  type: T,
  teamId: TeamId,
  userId: UserId,
}

type Effect =
| MembershipEffect<'ADD_MEMBER'>
| MembershipEffect<'DEL_MEMBER'>;
// etc

Next, the business logic. Instead of writing directly to the datastore as we were before, we'll now produce an 'ADD_MEMBER' effect representing a user who needs to be added.

async function prepareAddUserToTeams(teamIds: TeamId[], userId: UserId) {
  const effects: Effect[] = [];

  await Promise.all(teamIds.map(async function (teamId) {
    const members = await Teams.getMembers(teamId);
    if (!members.includes(userId)) {
      effects.push({ type: 'ADD_MEMBER', team, user });
    }
  }));

  return effects;
}

At this point we've addressed one of the shortcomings of the original implementation (separating out a failable DB write) while also gaining the ability to peek into the "performed" effects before some or all of them have been written. More on that in a moment.

Effect processing

Now that we're producing effects we need a way to apply them.

async function applyMemberships(effects: Effect[]) {
  await Promise.all(effects.map(function (effect) {
    switch (effect.type) {
      case 'ADD_MEMBER':
        return Teams.addMember(team, user);
      default:
        throw new Error(`Not implemented: "${effect.type}"`);
    }
  });
}

applyMemberships is a minimal effect processor, nothing more. It doesn't care where the events came from. It only cares that some upstream logic coughed them up, and--now that they're here--that they get applied to the application state. And since it's a pure, standalone function, it's easily extended. That could mean providing a general-purpose rollback strategy when event processing fails, or providing alternative "commit" strategies to ensure data is persisted via an appropriate API.

With DynamoDB, for instance, a single call to TransactWriteItems (if we needed transactional checks or guarantees) or BatchWriteItem (the rest of the time) will be both safer and cheaper than adding team member in separate PutItem requests. Making the switch is a small change to applyMemberships: just batch up the memberships and write them simultaneously:

async function applyMemberships(effects: Effect[]) {
  const putItems = effects.flatMap(function (effect) {
    if (effect.type === 'ADD_MEMBER') {
      return [Teams.asPutMembershipItem(effect)];
    }

    return [];
  });

  // in real life we would chunk large-n batches
  await docClient.batchWriteItem(putItems);
}

Crucially, we can make this adjustment without changing business logic! The same rules still apply. applyMemberships only cares about how effects are interpreted. If we need to change how data are written, we just do it. If we need to trigger a welcome email or notify a billing service, we could add an additional effect--or, if inextricably linked to the ADD_MEMBER effect, we can just tweak the processor to make sure they happen.

Logic and context

We can refactor prepareAddUserToTeams just as freely. For instance, we might finish decoupling business logic and data access by preloading memberships (or any other relevant context).

type Context = {
  membersByTeamId: {
    [teamId: TeamId]: UserId[],
  },
}

async function loadMembershipContext(teamIds: TeamId[]) {
  const teamMembers = await Teams.batchGetMembers(teamIds)

  const pairs = teamIds.map(function (teamId, i) {
    const members = teamMembers[i];
    return [teamId, members];
  });

  return {
    membersByTeamId: Object.fromEntries(pairs),
  }
}

With all of the I/O handled externally, the logic around adding new members condenses into an almost-trivial (and blissfully pure!) function.

function prepareAddUserToTeams(context: Context, userId: UserId) {
  const effects: Effect[] = [];
  const entries = Object.entries(context.membersByTeamId);
  for (const [teamId, members] of entries) {
    if (!members.includes(userId)) {
      effects.push({ type: 'ADD_MEMBER', team, user });
    }
  }

  return effects;
}

Pipes all the way down

The somewhat naive implementation we started with has gotten considerably more verbose. In return, decoupling the load-prepare-apply stages has yielded reusable solutions to two-thirds of any change related to the team's membership roster. Implement the business logic and you're off!

Using the load and apply building blocks, we can now collapse addUserToTeams down into something much more concise:

async function addUserToTeams(teamIds: TeamId[], userId: UserId) {
  const context = await loadMembershipContext(teamIds);
  const effects = prepareAddUserToTeams(context, userId);
  await applyMemberships(effects);
}

If you're familiar with Node's Stream API (or pretty much any functional programming language) you'll recognize a pipeline in the making. Here's how it would look if the Hack-style pipelines currently favored in the (still-very-unsettled) TC39 Pipeline proposal) were adopted:

async function addUserToTeams(teamIds, userId) {
  await loadMembershipContext(teamIds)
  |> prepareAddUserToTeams(%, userId)
  |> await applyMemberships(%)
}

Besides increasing testability, reusability, and confidence in each step, we can now add additional business logic independent of the data layer. For example, the same building blocks easily recombine into a new API method for populating
an entire team's roster:

async function populateTeam(teamId: TeamIds, userIds: UserId[]) {
  const context = await loadMembershipContext([teamId]);
  const effects = userIds.flatMap(userId =>
    prepareAddUserToTeams(context, userId)
  );
  await applyMemberships(effects);
}

After updating applyMemberships to handle DEL_MEMBER and SET_ACCESS effects (with some attention required around de-duplication and ordering), we could go on to implement an entire roster-management application with only minimal regard for where and how it's persisted.

Back in the real world

An effect-based approach adds undeniable indirection and (local) complexity. For a method as simple as addUserToTeams, our naive first implementation may be the right way to go. As users, side effects, or the surface area of our membership API increase, however, effects provide a fairly straightforward way to manage them. They might:

provide preflight mechanisms for data migrations (by inspecting a migration's effects before applying it)
simplify development of batch operations (e.g. archiving or deleting many related nodes in a graph) by encouraging proven, reusable load/apply methods
simplify complex business processes (e.g. computing and synchronizing billing details owned by a mix of 1st- and 3rd-party systems) by isolating effects from application.

Ultimately, using effects as the basis the load-prepare-apply pipeline at the heart of most state changes isn't the big step it seems. Yes, it takes time to verify mostly-independent parts and reconstitute them into working whole. But once built, and once trusted, they tremendously accelerate future development.

Cover image by Iker Urteaga via unsplash

Anatomy of a high-velocity CI/CD pipeline

RJ Zaworski — Thu, 02 Dec 2021 16:35:17 +0000

If you’re going to optimize your development process for one thing, make it speed. Not the kind of speed that racks up technical debt on the team credit card or burns everyone out with breathless sprints, though. No, the kind of speed that treats time as your most precious resource, which it is.

Speed is the startup’s greatest advantage. Speed means not wasting time. Incorporating new information as soon as it’s available. Getting products to market. Learning from customers. And responding quickly when problems occur. But speed with no safeguards is simply recklessness. Moving fast requires systems for ensuring we’re still on the rails.

We’ve woven many such systems into the sociotechnical fabric of our startup, but maybe the most crucial among them are the continuous integration and continuous delivery processes that keep our work moving swiftly towards production.

The business case for CI/CD

Writing in 2021 it’s hard to imagine building web applications without the benefits of continuous integration, continuous delivery, or both. Running an effective CI/CD pipeline won’t score points in a sales pitch or (most) investor decks, but it can make significant strategic contributions to both business outcomes and developer quality of life. The virtuous cycle goes something like this:

faster feedback
fewer bugs
increased confidence
faster releases
more feedback (even faster this time)

Even on teams (like ours!) that haven’t embraced the dogma (or overhead) of capital-A-Agile processes, having the confidence to release early and often still unlocks shorter development cycles reduces time to market.

As a developer, you’re probably already bought into this idea. If you’re feeling resistance, though, here’s a quick summary for the boss:

The business case for continuous integration and delivery

Is CI/CD worth the effort?

Nobody likes a red build status indicator, but the truth is that builds fail. That’s why status dashboards exist, and a dashboard glowing crimson in the light of failing builds is much, much better than no dashboard at all.

Still, that dashboard (nevermind the systems and subsystems it’s reporting on) is pure overhead. Not only are you on the hook to maintain code and release a dozen new features by the end of the week, but also the litany of scripts, tests, configuration files, and dashboards needed to build, verify, and deploy it. When the server farm of Mac Minis in the basement hangs, you’re on the hook to restart it. That’s less time available to actually build the app.

This is a false dilemma, though. You can solve this problem by throwing resources at it. Managed services eliminate much of the maintenance burden, and when you’ve reached the scale where one-size-fits-all managed services break down you can likely afford to pay a full-time employee to manage Jenkins.

So, there are excuses for not having a reliable CI/CD pipeline. They just aren’t very good ones. The payoff — in confidence, quality, velocity, learning, or whatever you hope to get out of shipping more software — is well worth any pain the pipeline incurs.

Yes, even if it has to pass through XCode.

A guiding principle

Rather than prescribing the ultimate CI/CD pipeline in an edict from on-high, we’ve taken guidance from one of our team principles and evolved our practices and automation from there. It reads:

Ship to Learn. We release the moment that staging is better than prod, listen early and often, and move faster because of it.

Continuous integration is a big part of the story, of course, but the same guidance applies back to the pipeline itself.

Releasing the moment that staging is better than prod is easy to do. This is nearly always the case, and keeping up with it means having both a lightweight release process and confidence in our work. Individual investment and a reasonably robust test suite are all well and good; better is having a CI/CD pipeline that makes them the norm (if not the rule).
Listening early and often is all about gathering feedback as quickly as we possibly can. The sooner we understand whether something is working or not, the faster we can know whether to double down or adapt. Feedback in seconds is better than in minutes (and certainly better than hours).
Moving faster includes product velocity, of course, but also the CI/CD process itself. Over time we’ve automated what we reasonably can; still, several exception-heavy stages remain in human hands. We don’t expect to change these soon, so here “moving fast” means enabling manual review and acceptance testing, but we don’t expect to replace them entirely any time soon.

So, our pipeline

Product velocity depends on the pipeline that enables it. With that in mind, we’ve constructed our pipeline to address the hypothesis that issues uncovered at any stage are more exponentially expensive to fix than those solved at prior stages. Issues will happen, but checks that uncover them early on drastically reduce friction at the later, more extensive stages of the pipeline.

Here’s the boss-friendly version:

Test early, test often

Local development

Continuous integration starts immediately. If you disagree, consider the feedback time needed to integrate and test locally versus anywhere else. It’s seconds (rebasing against our main branch or acting on feedback from a pair-programming partner), minutes (a full run of our test suite) or less.

We’ve made much of if automatic. Our editors are configured to take care of styles and formatting; TypeScript provides a first layer of testing; and shared git hooks run project-specific static checks.

One check we don’t enforce is to run our full test suite. Run time goes up linearly with the size of a test suite, and — while we’re culturally averse to writing tests for their own sake — running our entire suite on every commit would be prohibitively expensive. What needs testing is up to individual developers’ discretion, and we avoid adding redundant or pointless tests to the test suite just as we avoid redundant test runs.

Make it fast, remember? That applies to local checks, too. Fast checks get run. Slow checks? No-one has time for that.

Automated CI

Changes pushed from local development to our central repository trigger the next layer of checks in the CI pipeline. Feedback here is slower than in local development but still fairly fast, requiring about 10 minutes to run all tests and produce a viable build.

Here’s what it looks like in Github:

Green checks are good checks.

There are several things going on here: repeats of the linting and static analysis run locally, a run through our completed backend test suite, and deployment of artifacts used in manual QA. The other checks are variations on this theme—different scripts poking and prodding the commit from different angles to ensure it's ready for merging into main. Depending on the nature of the change, we may require up to a dozen checks to pass before the commit is greenlit for merge.

Peer review

In tandem with the automated CI checks, we require manual review and sign-off before changes can be merged into main.

“Manual!?” I hear the purists cry, and yes — the “M” word runs counter to the platonic ideal of totally automated CI. Hear me out. The truth is that every step in our CI/CD pipeline existed as a manual process first. Automating something before truly understanding it is a sure path to inappropriate abstractions, maintenance burden, and at least a few choice words from future generations. And it doesn’t always make sense. For processes that are and always will be dominated by exceptions (design review and acceptance testing, to pick two common examples) we’ve traded any aspirations at full automation for tooling that enables manual review. We don’t expect to change this any time soon.

Manual review for us consists of (required) code review and (optional) design review. Code review covers a checklist of logical, quality, and security concerns, and we (plus Github branch protection) require at least two team members to believe a change is a good idea before we ship it. Besides collective ownership, it’s also a chance to apply a modicum of QA and build shared understanding around what’s changing in the codebase. Ideally, functional issues that weren’t caught locally get caught here.

Design review

Design review is typically run in tandem with our counterparts in product and design, and aims to ensure that designs are implemented to spec. We provide two channels for reviewing changes before a pull request is merged:

preview builds of a “live” application that reviewers can interact with directly
storybook builds that showcase specific UI elements included within the change

Both the preview and storybook builds are linked from Github’s pull request UI as soon as they’re available. They also nicely illustrate the type of tradeoffs we’ve frequently made between complexity (neither build is trivial to set up and maintain), automation (know what would be trickier? Automatic visual regression testing, that’s what) and manual enablement (the time we have decided to invest has proven well worth it).

The bottom line is that — just like with code review — we would prefer to catch design issues while pairing up with the designer during initial development. But if something slipped through, design review lets us respond more quickly than at stages further down the line.

The feedback from manual review steps is still available quickly, though: generally within an hour or two of a new pull request being opened. And then it’s on to our staging environment.

Continuous delivery to staging

Merging a pull request into our main branch finally flips the coin from continuous integration to continuous delivery. There's one more CI pass first, however: since we identify builds by the commit hash they're built from, a merge commit in main triggers a new CI run that produces the build artifact we deliver to our staging environment.

The process for vetting a staging build is less prescriptive than for the stages that precede it. Most of the decision around how much QA or acceptance testing to run in staging rests with the developer on-call (who doubles as our de-facto release manager), who will review a list of changes and call for validation as needed. A release consisting of well-tested refactoring may get very little attention. A major feature may involve multiple QA runs and pull in stakeholders from our product, customer success, and marketing teams. Most releases sit somewhere in the middle.

Every staging release receives at least passing notice, for the simple reason that we use Koan ourselves — and specifically, an instance hosted in the staging environment. We eat our own dogfood, and a flavor that’s always slightly ahead of the one our customers are using in production.

Staging feedback isn’t without hiccups. At any time we’re likely to have 3–10 feature flags gating various in-development features, and the gap between staging and production configurations can lead to team members reporting false positives on features that aren’t yet ready for release. We’ve also invested in internal tooling that allows team members to adopt a specific production configuration in their local or staging environments.

The aesthetics are edgy (controversial, even), but the value is undeniable. We’re able to freely build and test features prior to production release, and then easily verify whether a pre-release bug will actually manifest in the production version of the app.

If you’re sensing that issues caught in staging are more expensive to diagnose and fix than those caught earlier on, you’d be right. Feedback here is much slower than at earlier stages, with detection and resolution taking up to several hours. But issues caught in staging are still much easier to address before they’re released to production.

Manual release to production

The “I” in CI is unambiguous. Different teams may take “integration” to mean different things — note the inclusion of critical-if-not-exactly-continuous manual reviews in our own integration process — but “I” always means “integration.”

The “D” is less straightforward, standing in (depending on who you’re talking to, the phase of the moon, and the day of the week) for either “Delivery” or “Deployment,” and they’re not quite the same thing. We’ve gained enormous value from Continuous Delivery. We haven’t made the leap (or investment) to deploy directly to production.

That’s a conscious decision. Manual QA and acceptance testing have proven tremendously helpful in getting the product right. Keeping a human in the loop ahead of production helps ensure that we connect with relevant stakeholders (in product, growth, and even key external accounts) prior to our otherwise-frequent releases.

Testing in production

As the joke goes, we test comprehensively: all issues missed by our test suite will be caught in production. There aren’t many of these, fortunately, but a broad enough definition of testing ought to encompass the instrumentation, monitoring, alerting, and customer feedback that help us identify defects in our production environment.

We’ve previously shared an outline of our cherished (seriously!) on-call rotation, and the instrumentation beneath it is a discussion for another day, but suffice to say that an issue caught in production takes much longer to fix than one caught locally. Add in the context-switching required from team members who have already moved on to other things, and it’s no wonder we’ve invested in catching issues earlier on!

Revising the pipeline

Increasing velocity means adding people, reducing friction, or (better yet) both. Hiring is a general problem. Friction is specific to the team, codebase, and pipeline in question. We adopted TypeScript to shorten feedback cycles (and save ourselves runtime exceptions and pagerduty incidents). That was an easy one.

A less obvious bottleneck was how much time our pull requests were spending waiting for code review — on average, around 26 hours prior to merge. Three and half business days. On average. We were still deploying several times per day, but with several days’ worth of work-in-process backed up in the queue and plenty of context switching whenever it needed adjustment.

Here’s how review times tracked over time:

This chart is fairly cyclical, with peaks and troughs corresponding roughly to the beginning and end of major releases — big, controversial changes as we’re trailblazing a new feature; smaller, almost-trivial punchlist items as we close in on release day. But the elephant in the series lands back around March 1st. That was the start of Q2, and the day we added “Code Review Vitals” to our dashboard.

It’s been said that sunlight cures all ills, and simply measuring our workflow had the dual effects of revealing a significant bottleneck and inspiring the behavioral changes needed to correct it.

Voilá! More speed.

Conclusion

By the time you read this post, odds are that our CI/CD pipeline has already evolved forward from the state described above. Iteration applies as much to process as to the software itself. We’re still learning, and — just like with new features — the more we know, and the sooner we know it, the better off we’ll be.

With that, a humble question: what have you learned from your own CI/CD practices? Are there checks that have worked (or totally flopped) that we should be incorporating ourselves?

We’d love to hear from you!

Every Software Team Deserves a Charter

RJ Zaworski — Thu, 02 Dec 2021 16:26:59 +0000

Software teams own features, projects, and services—everyone knows that. But caught up in what we’re doing, it’s easy to lose sight of why we’re doing it. Digging into the details might turn up some clues:

Why does our team exist? To maintain Service X
Why does Service X need to run? Because it’s a tier-two service that’s a dependency of tier-one Services A and Y
What happens if Service A goes down? I’m not sure, but it doesn’t sound good.

Teams exist for a reason, opaque thought it may be, and team members should know what it is.

Teams are works-in-progress

Clarity is even harder to come by when a team is just starting out. Wouldn’t it be nice if teams arrived in the world as the ancient Greek poet Hesiod describes the birth of the goddess Athena?

And the father of men and gods gave her birth by way of his head…arrayed in arms of war. —Hesiod, Theogeny

Anyone who has collaborated with other people can confirm that—unlike ancient Greek deities—teams do not spring forth fully-formed. Nor are they the product of a single head: teams are made up of individuals, and even with a cohesive vision of the team’s objective (not itself a given), everyone likely won’t agree on the best way to get there.

The psychologist Bruce Tuckman framed this reality with a four-stage model for group development. In Tuckman’s model, teams form, storm, norm, and perform, with high-performing teams emerging only after weather the turbulence of the early stages. But while we can’t control the sequence of events, we can hasten the journey.

Accelerating team formation

Shared expectations are the foundation underlying all effective teamwork. Yet often the process of establishing them is left up to chance. It’s true that time and good intentions usually lead to common ground, but by forcing explicit conversations about the team’s intentions and beliefs up front, a written charter can significantly accelerate the process.

At a minimum, a charter should lay out the team’s:

mission statement, summarizing the team’s shared purpose
principles for conduct and decision-making
key performance indicators (KPIs) representing the team’s status and health

While a team lead or manager may write the first draft, revisions are highly encouraged: input and feedback from all team members will only help ensure the charter represents a common understanding of the team’s identity. Ideally the charter-drafting process will start to build collective ownership as well.

Let’s dig into specifics, and look at how we’ve addressed them in the Koan dev team charter.

Mission statement

A charter’s mission statement is a single sentence summarizing what the team does, for whom, and how. Rather than serving specific customer personas or internal stakeholders, our development team is on the hook to advance the company and its mission as a whole. We do it by shipping reliable software and holding ourselves (and our colleagues) to high standards. As the mission statement at the top of our charter reads, we exist:

To advance Koan through engineering excellence and continuous improvement.

Principles

Principles are the guidelines that the team can fall back on when assessing contradictory or otherwise unclear choices. They also set expectations. A principles that we, “win the marathon” both encourages thoughtful, long-term decision-making and implies that team members will do it.

The team’s principles should be short and memorable. In creating our own charter, we brainstormed, debated, and revised our way down to just four. They’re both a clear expression of our common values and simple enough to remember. They read:

In support of our company Mission, Vision and Goals, and Values, Koan engineers:

Figure it out. We find a way to deliver our objectives while continuously improving along the way.

Ship to learn. We release the moment that staging is better than prod, listen early and often, and move faster because of it.

Deliver customer value. Our work directly benefits our customers — whether they’re outside Koan or at the next desk down.

Win the marathon. We’re in it for the long haul, making decisions that balance today’s needs against the uncertain future ahead.

Once again, our closeness to the rest of the company shows through in a brief preamble connecting our team-specific principles back to the mission, vision, and values of the organization as a whole.

KPIs

The team’s charter should include a measurable definition of its health. Is the team maintaining basic responsibilities and expectations? Team members should always be able to reference KPIs that quantify its current status.

As with the mission and principles, the specific metrics will vary considerably across functions. While a sales org may be looking at calls per rep or the total value of qualified leads, a dev team will often focus on the “—ilities”—stability, durability, and so on.

Our own KPIs are split between numbers we’re interested in (but not actively losing sleep over) and numbers that really matter. The latter are important enough to take up precious real estate on our company dashboard, and as the lone development team in a dynamic startup we’ve limited our focused to just two themes with very specific measures:

Quality: % TypeScript coverage (FE, BE)
Velocity: PR lifetime (time delta from opened to merged)

There are plenty of other numbers we’re interested in—but for the charter (2021 edition) those two were disproportionately more important to our continued improvement (and quality of life) as a team.

The team evolves. The charter, too.

Existing teams need charters, too, and chances are they aren’t the same as when the team was first formed. Explicit or otherwise, the charter will change. It should change. Revisit it quarterly, revise it yearly, or whenever:

A principle needs updating to reflect changing expectations or operating conditions
A KPI is significantly exceeded, or becomes an automatic part of the culture
Team members join or leave
And so on!

Of course, the charter is just the beginning. So much more goes into an effective team, from the skills individual team members bring to the goals they work together to achieve.

Even as expectations change over the team’s lifetime, team members should never lack clarity on why the team exists, or on how they can show up and contribute!

Making the most of our startup’s on-call rotation

RJ Zaworski — Tue, 16 Nov 2021 00:19:39 +0000

Will I have to be on call?

In the last hour of Koan’s on-site interview we turn the tables and invite candidates to interview our hiring team. At face value it’s a chance to answer any open questions we haven’t answered earlier in the process. It’s also a subtle way to introspect on our own hiring process — after three rounds of interviews and side-channel conversations with the hiring manager, what have we missed? What’s on candidates’ minds? Can we address it earlier in the process?

So, you asked, will I have to be on call?

The middle of the night pager rings? The panicked investigations? Remediation, write-ups, post-mortems?
We get it. We’ve been there.

Patrick Collison’s been there, too:

// Detect dark theme var iframe = document.getElementById('tweet-1432731774270906369-537'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1432731774270906369&theme=dark" }

“Don’t ruin the duck.” There are worse guiding principles for an on-call process (and operational health generally).

So, will I have to be on call?

Yeah, you will. But we’ve gotten a ton out of Koan’s on-call rotation and we hope you will, too. Ready to learn more?

On-call at Koan

We set up Koan’s on-call rotation before we’d heard anything about Patrick’s ducks. Our version of “don’t ruin the duck,” included three principles that (if somewhat less evocative) have held up surprisingly well:

We concentrate distractions — our on-call developer is tasked with minimizing context switching for the rest of the team. We’ll escalate incidents if needed, but as much as possible the business of ingesting, diagnosing, and triaging issues in production services stays in a single person’s hands — keeping the rest of the team focused on shipping great product.
We control our own destiny — just like Koan’s culture at large, being on-call is much more about results (uptime, resolution time, pipeline throughput, and learning along the way) than how they come about. Our on-call developer wields considerable authority over how issues are fielded and dispatched, and even over the production release schedule.
We take turns — on-call responsibilities rotate weekly. This keeps everyone engaged with the on-call process and avoids condemning any single person to an eternity (or even an extended period) of pager duty.

These principles have helped us wrangle a fundamentally interrupt-driven process. What we didn’t realize, though, was how much time — and eventually, value — we were recovering between the fire drills.

How bugs begin

Before that, though, we’d be remiss to skip the easiest path to a calm, quiet on-call schedule: don’t release. To paraphrase Descartes, code ergo bugs — no matter how diligent you are in QA, shipping software means injecting change (and therefore new defects) into your production environment.

Not shipping isn’t an option. We’re in the habit of releasing multiple times per day, not to mention all of the intermediate builds pushed to our staging environment via CI/CD. A production issue every now and then is a sign that the system’s healthy; that we’re staying ambitious and shipping fast.

But it also means that things sometimes break. And when they do, someone has to pick up the phone.

Goals

On the bad days, on-call duty is a steady stream of interruptions punctuated by the occasional crisis. On the good days it isn’t much to write home about. Every day, though, there are at least a few minutes to tighten down screws, solve problems, and explore the system’s nooks and crannies. This is an intentional feature (not a bug) of our on-call rotation, and the payoff has been huge. We’ve:

built shared ownership of the codebase and production systems
systematized logging, metrics, monitoring, and alerting
built empathy for customers (and our support processes)
spread awareness of little-used features (we’re always onboarding)
iterated on key processes (ingestion/triage, release management, etc)

You don’t get all that by just passing around a firefighting hat. You need buy-in and — crucially — a healthy relationship with your production environment. Which brings us back to our principles, and the on-call process that enables it.

We concentrate distractions

When something breaks, the on-call schedule clarifies who’s responsible for seeing it’s fixed. As the proverbial umbrella keeping everyone else focused and out of the rain (sometimes a downpour, sometimes a drizzle), you don’t need to immediately fix every problem you see: just to investigate, file, and occasionally prioritize them for immediate attention.

That still means a great deal of on-call time spent ingesting and triaging a steady drip of symptoms from:

customer issues escalated by our customer success team
internal bug reports casually mentioned in conversations, slack channels, or email threads
exceptions/alerts reported by application and infrastructure monitoring tools

Sometimes symptoms aren’t just symptoms, and there’s a real issue underneath. Before you know it, the pager starts ringing—

Enter the pager

The water’s getting warmer. A pager ping isn’t the end of the world, but we’ve tuned out enough false positives that an alert is a good sign that something bad is afoot.

Once you’ve confirmed a real issue, the next step is to classify its severity and impact. A widespread outage? Those need attention immediately. Degraded performance in a specific geography? Not awesome, but something that can probably wait until morning. Whatever it is, we’re looking to you to coordinate our response, both externally (updating our status page) and either escalating or resolving the issue yourself.

On-call isn’t a private island. There will always be times we need to pause work in progress, call in the team, and get to the bottom of something that’s keeping us down. But the goal is to do it in a controlled fashion, holding as much space for everyone else as you reasonably can.

We control our own destiny

Your responsibilities aren’t purely reactive, however. Controlling your own destiny means having at least a little agency over what breaks and when. This isn’t just wishful thinking. While issues introduced in the past are always a lurking threat — logical edge cases, bottlenecks, resource limits, and so on — the source of most new issues is a new release.

It makes sense, then, for whoever’s on-call to have the last word on when (and how) new releases are shipped. This includes:

managing the release — generating changelogs, reviewing the contents of the release, and ensuring the appropriate people are warned and signatures are obtained
debugging release / deployment issues — monitoring both the deployment and its immediate aftermath, and remediating any issues that arise
making the call on hotfix releases and rollbacks — as a step sideways from our usual flow they’re not tools we use often. But they’re there (and very quick) if you need them

Closing the feedback loop

An unexpected benefit we’ve noticed from coupling on-call and release management duties is the backpressure it puts on both our release cadence and deployment pipeline. If we’re underwater with issues from the previous release, the release manager has strong incentives to see they’re fixed before shipping anything else. Ditto any issues in our CI/CD processes.

Neither comes up too often, fortunately, and while we can’t totally write off the combination of robust systems and generally good luck, it’s just as hard to discount the benefits of tight feedback and an empowered team.

We take turns

But you said, “team!” — a lovely segue to that last principle. Rotating on-call responsibility helps underscore our team’s commitment to leaving a relatively clean bill (releases shipped, exceptions handled; tickets closed; etc) for the next person up. When you’re on-call, you’re the single person best placed to deflect issues that would otherwise engulf the entire team. When you’re about to be on call, you’re invested in supporting everyone else in doing the same. You’d love to start your shift with:

healthy systems
a manageable backlog of support inquiries
a clear list of production exceptions
a quick brain-dump of issues fielded (and ongoing concerns) from the teammate you’re taking over from

A frequent rotation almost guarantees that everybody’s recently felt the same way. Team members regularly swap shifts (for vacations, appointments, weddings, anniversaries, or any other reason), but it’s never long before you’re back on call.

The rest of the time

Ultimately, we’ve arrived at an on-call process that balances the realities of running software in production with a high degree of agency. We didn’t explicitly prioritize quality of life, and we don’t explicitly track how much time on-call duties are eating up. But collective ownership, individual buy-in, and tight feedback have pushed the former up and the latter down, to the point where you’ll find you have considerable time left over for other things. Ideally you’ll use your turn on-call to dig deeper into the issues you touch along the way:

exploring unfamiliar features (with or without reported bugs)
tightening up our CI processes
tuning configurations
writing regression tests
improving logging and observability

Yes, you’ll be triaging issues, squashing bugs, and maybe even putting out the odd production fire. You can almost count on having time left to help minimize the need for on-call. You’re on the hook to fix things if they break — and empowered to make them better.

So yes, you’ll have to take an on-call shift.

Help us make it a good one!

Cover image by Daniel Seßler on Unsplash

From Pebbles to Brickworks: a Story of Cloud Infrastructure Evolved

RJ Zaworski — Tue, 18 Aug 2020 18:16:23 +0000

You can build things out of pebbles. Working with so many unique pieces isn’t easy, but if you slather them with mortar and fit them together just so, it’s possible to build a house that won’t tumble down in the slightest breeze.

Like many startups, that’s where Koan’s infrastructure started. With lovingly hand-rolled EC2 instances sitting behind lovingly hand-rolled ELBs inside a lovingly — yes — hand-rolled VPC. Each came with its own quirks, software updates, and Linux version. Maintenance was a constant test of our technical acumen and patience (not to mention nerves); scalability was out of the question.

These pebbles carried us from our earliest prototypes to the first public iteration of Koan’s leadership platform. But there comes a day in every startup’s journey when its infrastructure needs to grow up.

Motivations

The wolf chased them down the lane and he almost caught them. But they made it to the brick house and slammed the door closed.

What we wanted were bricks, uniform commodities that can be replicated or replaced at will. Infrastructure built from bricks has some significant advantages over our pebbly roots:

Visibility. Knowing who did what (and when) makes it possible to understand and collaborate on infrastructure. It’s also an absolute must for compliance. Repeatable, version-controlled infrastructure supplements application changelogs with a snapshot of the underlying infrastructure itself.
Confidence. Not knowing — at least, not really knowing — infrastructure makes changes very nervous. For our part, we didn’t. Which isn’t a great position to be in when that infrastructure needs to scale.
Consistency. Pebbles come in all shapes and sizes. New environment variables, port allocations, permissions, directory structure, and dependencies must be individually applied and verified on each instance. This consumes development time and increases the risk of “friendly-fire” incidents from any inconsistencies between different hosts (see: #2).
Repeatability. Rebuilding a pebble means replicating all of the natural forces that shaped it over the eons. Restoring our infrastructure after a catastrophic failure seemed like an impossible task—a suspicion that we weren’t in a hurry to verify.
Scalability. Replacing and extending are two sides of the same coin. While it’s possible to snap a machine image and scale it out indefinitely, an eye to upkeep and our own mental health encouraged us to consider a fresh start. From a minimal, reasonably hardened base image.

Since our work at Koan is all about goal achievement, most of our technical projects start exactly where you’d expect. Here: reproducible infrastructure (or something closer to it), documented and versioned as code. We had plenty of expertise with tools like terraform and ansible to draw on and felt reasonably confident putting them to use—but even with familiar tooling, our initially shaky foundation didn’t exactly discourage caution.

That meant taking things step by gradual step, establishing and socializing patterns that we intended to eventually adopt across all of our cloud infrastructure. That’s a story for future posts, but the journey had to start somewhere.

Dev today, tomorrow the world

“Somewhere,” was our trusty CI environment, dev. Frequent, thoroughly-tested releases are both a reasonable expectation and a point of professional pride for our development team. dev is where the QA magic happens, and since downtime on dev blocks review, we needed to keep disruptions to a minimum.

Before dev could assume its new form, we needed to be reasonably confident that we could rebuild it:

…in the right VPC
…with the right Security Groups assigned
…with our standard logging and monitoring
…and provisioned with a working instance of the Koan platform

Four little tests, and we’d have both a repeatable dev environment and a template we could extend out to production.

We planned to tackle dev in two steps. First, we would document (and eventually rebuild) our AWS infrastructure using terraform. Once we had a reasonably-plausible configuration on our hands, we would then use ansible to deploy the Koan platform. The two-step approach deferred a longer-term dream of fully-immutable resources, but it allowed us to address one big challenge (the infrastructure) while leaving our existing deployment processes largely intact.

Replacing infrastructure with Terraform

First, the infrastructure. The formula for documenting existing infrastructure in terraform goes something like this:

Create a stub entry for an existing resource
Use terraform import to attach the stub to the existing infrastructure
Use terraform state and/or terraform plan to reconcile inconsistencies between the stub and reality
Repeat until all resources are documented

Here’s how we documented the dev VPC's default security group, for example:

$ echo '
resource "aws_default_security_group" "default" {
  # hard-coded reference to a resource not yet represented in our
  # Terraform configuration
  vpc_id = var.vpc_id
}' >> main.tf

At this point, we could run terraform plan to see the difference between the existing infrastructure and our Terraform config:

$ terraform import aws_default_security_group.default sg-123456
$ terraform plan
# module.dev-appserver.aws_default_security_group.default will be updated in-place
  ~ resource "aws_default_security_group" "default" {
      ~ egress                 = [
          - {                                 
              - cidr_blocks      = [
                  - "0.0.0.0/0",
                ]                    
              - description      = ""
              - from_port        = 0
              - ipv6_cidr_blocks = []
              - prefix_list_ids  = []
              - protocol         = "-1"
              - security_groups  = []
              - self             = false
              - to_port          = 0
            },
        ]
        id                     = "sg-123456"
    # ...
    }

Using the diff as an outline, we could then fill in the corresponding aws_default_security_group.default entry:

# main.tf
resource "aws_default_security_group" "default" {
  vpc_id = var.vpc_id
  ingress {
    protocol  = -1
    self      = true
    from_port = 0
    to_port   = 0
  }
  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

Re-running terraform plan, we could verify that the updated configuration matched the existing resource:

$ terraform plan
...
No changes. Infrastructure is up-to-date.
This means that Terraform did not detect any differences between
your configuration and real physical resources that exist. As a 
result, no actions need to be performed.

The keen observer will recognize a prosaic formula crying out for automation, a call we soon answered. But for our first, cautious steps, it was helpful to document resources by hand. We wrote the configurations, parameterized resources that weren’t imported yet, and double-checked (triple-checked) our growing Terraform configuration against the infrastructure reported by the aws CLI.

Sharing Terraform state with a small team

By default, Terraform tracks the state of managed infrastructure in a local tfstate file. This file contains both configuration details and a mapping back to the “live” resources (via IDs, resource names, and in Amazon’s case, ARNs) in the corresponding cloud provider. As a small, communicative team in a hurry, we felt comfortable bucking best practices and checking our state file right into source control. In almost no time we ran into collisions across git branches—a shadow of collaboration and locking problems to come—but we resolved to adopt more team-friendly practices soon. For now, we were up and running.

Make it work, make it right.

Provisioning an application with Ansible

With most of our dev infrastructure documented in Terraform, we were ready to fill it out. At this stage our attention shifted from the infrastructure itself to the applications that would be running on it—namely, the Koan platform.

Koan’s platform deploys as a monolithic bundle containing our business logic, interfaces, and the small menagerie of dependent services that consume them. Which services run on a given EC2 instance will vary from one to the next. Depending on its configuration, a production node might be running our REST and GraphQL APIs, webhook servers, task processors, any of a variety of cron jobs, or all of the above.

As a smaller, lighter, facsimile, dev has no such differentiation. Its single, inward-facing node plays host to the whole kitchen sink. To simplify testing (and minimize the damage to dev), we took the cautious step of replicating this configuration in a representative local environment.

Building a local Amazon Linux environment

Reproducing cloud services locally is tricky. We can’t run EC2 on a developer’s laptop, but Amazon has helpfully shipped images of Amazon Linux—our bricks’ target distribution. With a little bit of fiddling and a lot of help from cloud-init, we managed to bring up reasonably representative Amazon Linux instances inside a local VirtualBox:

$ ssh -i local/ssh/id_rsa dev@localhost -p2222
Last login: Fri Sep 20 20:07:30 2019 from 10.0.2.2
       __|  __|_  )
       _|  (     /   Amazon Linux 2 AMI
      ___|\\___|___|

At this point, we could create an ansible inventory assigning the same groups to our "local" environment that we would eventually assign to dev:

# local/inventory.yml
appservers:
  hosts:
    127.0.0.1:2222
cron:
  hosts:
    127.0.0.1:2222
# ...

If we did it all over again, we could likely save some time by skipping VirtualBox in favor of a detached EC2 instance. Then again, having a local, fast, safe environment to test against has already saved time in developing new ansible playbooks. The jury’s still out on that one.

Ansible up!

With a reasonable facsimile of our “live” environment, we were finally down to the application layer. ansible approaches hosts in terms of their roles—databases, webservers, or something else entirely. We approached this by separating out two “base” roles for our VMs generally (common) and our app servers in particular (backend), where:

The common role described monitoring, the runtime environment, and a default directory structure and permissions
The backend role added a (verioned) release of the Koan platform

Additional roles layered on top represent each of our minimally-dependent services — api, tasks, cron, and so on—which we then assigned to the local host:

# appservers.yml 
- hosts: all
  roles:
  - common
  - backend
- hosts: appservers
  roles:
  - api
- hosts: cron
  roles:
  - cron

We couldn’t bring EC2 out of the cloud, but bringing up a local instance that quacked a lot like EC2 was now as simple as:

$ ansible-playbook \
  --user=dev \
  --private-key ./local/ssh/id_rsa \
  --inventory local/inventory.yml \
  appservers.yml

From pebbles to brickwork

With our infrastructure in terraform, our deployment in ansible, and all of the confidence that local testing could buy, we were ready to start making bricks. The plan (and there’s always a plan!) was straightforward enough:

Use terraform apply to create a new dev instance
Add the new host to our ansible inventory and provision it
Add it to the dev ELB and wait for it to join (assuming provisioning succeeded and health checks passed)
Verify its behavior and make adjustments as needed
Remove the old dev instance (our pebble!) from terraform
Rinse and repeat in production

The entire process was more hands-on than anyone really wanted, but given the indeterminate state of our existing infrastructure and the guiding philosophy of, step one was simply waving dev out the door.

Make it work, make it right.

Conclusion

Off it went! With only a little back and forth to sort out previously unnoticed details, our new dev host took its place as brick #1 in Koan’s growing construction. We extracted the dev configuration into a reusable terraform module and by the end of the week our brickwork stretched all the way out to production.

In our next post, we'll dive deeper into how we imported volumes of undocumented infrastructure into Terraform.

Big thanks to Ashwin Bhat for early feedback, Randall Gordon and Andy Beers for helping turn the pets/cattle metaphor into something more humane, and EMAR DI on Unsplash for the cover image.

And if you’re into building software to help every team achieve its objectives, Koan is hiring!

High-Value Software Testing

RJ Zaworski — Sat, 11 Jul 2020 16:37:44 +0000

Tests help developers eliminate defects, build confidence, practice good design, and ideally all three. They also take time to write, run, and update--time that's no longer available for other development tasks.

High-value testing seeks to maximize the return on that investment. Like much of software development, it's as much art as science. But a few practical principles can help keep things pointed in the right direction.

The golden rule of high-value testing

Remember: the aim of software testing—the most fundamental reason to bother—is to secure the value the software creates. If a test doesn't secure value, it's better having no test at all.

Test the right things

The relative benefit of additional testing tapers off as coverage nears 100%. In a safety-critical application the conversation may begin at 100%--but in less-risky settings, there are pragmatic arguments against such exhaustive testing.

Test what you must (and only what you must)

The familiar challenge of balancing quality, scope, and time (pick two) applies to testing, too. As low-quality tests are worse than useless and software has no value until it ships, the compromise usually comes down to scope.

If there’s only budget for a single test, the highest return will come from verifying that the positive, “happy path” is working as intended. From there, tests can expand to cover the negative cases as risk requires and budget allows.

Ask:

Do tests inspire an appropriate level of confidence in our application?
Is testing slowing upfront development?

Focus on the interface

All the outside world knows about a function, class, or module is its public interface. Tests should begin and end there.

As software evolves, its internal implementation should be free to follow. Testing takes time. Testing details that don’t impact outward behavior is inefficient; worse, it can confuse and discourage future changes. Internal logic that can be extracted into a generally-useful unit may deserve a test of its own—but if not, and if the interface above it is working as intended, don’t lose sleep over what’s happening beneath.

Testing is also an opportunity both to show how the interface is used and to actually use it. Every test is an extra chance to smooth out a rough edge before it can snag a customer. Afterwards, it’s living documentation.

Ask:

Do tests reference internal methods or configuration settings that wouldn’t be seen in normal operations?
Do tests reflect how the interface should be used?
Do tests document both the interface’s use cases and error states?

Test in layers

No matter the slope of the testing pyramid, separating the zippy, independent tests that verify local behavior from the hefty specs securing the assembled system will improve the responsiveness and value of both. The goal is a reliable test suite, yes, but also one that encourages a fast feedback cycle. Nobody wants to wait through a treacle-slow end-to-end run just to confirm that a unit test passes, nor should that end-to-end run shouldn’t spend much time repeating unit test assertions.

Ask:

Is it easy to delineate between lightweight unit tests and more-expensive functional or end-to-end tests?
Are functional or end-to-end tests repeating details that should be captured by unit tests?

Trust proven dependencies

Any open-source library worth its salt will ship with its own community-maintained test suite. It shouldn’t require additional testing in userland. Trusting the community’s experience (and contributing patches upstream if that faith proves misplaced) means more time to test business logic and raise the value of the test suite as a whole.

Ask:

Are dependencies in widespread use by the community?
Are tests spending time re-asserting “known-good” behavior in dependencies?
Can gaps in a dependency’s test suite be contributed upstream?

Mock what you must (but nothing more)

Most of the parts within a reasonably complex software application are both dependent on and dependencies of other parts of the same. It’s common for tests to supply mocks—test-specific implementations of external interfaces—to help isolate unit behavior. But see the problem? If the "real" implementation changes, an out-of-date mock will still allow tests to pass. Even a correct mock represents new code to write, verify, and maintain--who will be testing that?

If a mock isn't needed, don’t write it. Don’t be shy about refactoring to make it unnecessary, either. Mocks often crop up when logic and I/O are tightly coupled; in many cases decoupling them will lead to more reusable design (and simpler testing) than supplying internal behavior via a test mock.

If a mock is needed, though:

Keep it simple (most testing utilities provide canonical tools to help with this).
Keep it clean (don’t use mocks to store state, and don’t share mocks between tests)

Ask:

Can logic be tested independent of external behavior?
Are mocks implemented using tools provided by a test utility or the language’s standard library?
What assumptions are represented (and upheld) by each mock?

Inspire confidence

Tests that can't be trusted are worse than no tests at all. Energy invested in both their reliability and depiction of reality will always, always pay off.

Different run, same result

It works on my machine
-- anonymous

A test suite should reach the same conclusions no matter who’s running it. For unit tests this means avoiding dependencies on network, I/O, the local timezone, and other tests. Deeper, end-to-end tests should proceed from a consistent, isolated state—shared datastores or service dependencies are a recipe for heartache when multiple test runs are proceeding in parallel in a continuous integration environment.

Tests should isolate a single variable, poke it, and see how it responds. If other things are poking it at the same time, it takes some seriously fancy filters and statistical magic--or more likely a sad, sad playthrough of the run-it-until-it-passes game--to get back to the way things were working before.

Ask:

Do tests interact with a database, filesystem, the network interface, or perform any other (potentially-fraught) I/O?
Do tests depend on or otherwise interact with other tests?
Will a time-dependent test produce the same output when run in another timezone?
Will the same test pass if it’s run on another operating system / filesystem?
Does logic depend on random number-generators or any other non-deterministic state?

Non-deterministic tests clog development workflows and breed mistrust in the test suite generally. Fix them promptly. Better yet, avoid writing them in the first place.

Test reality

Production is the one place in the Universe that really, truly quacks like production. There’s something to be said for testing there, particularly around performance-related issues, but it’s also a very expensive place to get things wrong.

Given the hefty operational risk of testing in production, it’s usually best to substitute a reasonable facsimile instead. The application configuration and operating environment should be as “prod-like” as possible, even if system parameters are beefed up or the underlying dataset is slimmed down in the name of developer experience.

Ask:

Where does the test environment deviate from production?
What production-specific issues aren’t (or can’t be) adequately covered in the test suite?

Human factors

A comprehensive test suite is a worthy aspiration with diminishing returns. Squeezing the most from it requires a careful balance between the value it secures and its upkeep cost. A good test suite isn’t merely complete: it’s also relatively straightforward to modify or run. Tests that are easy to write, verify, and adjust encourage quick iteration and deliver more value. And tests that no-one wants to touch…

One lens for assessing upkeep is the time needed to modify an existing test.

Explain failures

If an assertion fails, it shouldn’t take an expedition to explain where or why. An easy first step towards clear testing (itself an art form) is to make sure the runner or test description highlights the test’s subject and motivation.

For unit tests, identify the test’s subject and what the test is meant to prove. This might be a module, class, or method—it might even be assigned by convention in the test harness. Next, explain the motivation. What is the test out to achieve?

For higher-level integration or end-to-end testing, it may be more appropriate to describe tests in terms of the actors involved and their progress through a predefined test script. Again, any failures should make it clear what the actor was trying to accomplish and where their expectations weren’t met.

Ask:

Do failed tests clearly indicate what was being tested?
Do failed tests clearly indicate what failed?
Do test descriptions follow a predictable format throughout the test suite?

No test is sacred

Dogmatic test-driven developers test first. Less committed developers may test later, or not at all. But whether starting with the tests or adding them after the fact, there’s no reason to hold onto tests that have outlived their usefulness. As logic changes, tests may become irrelevant, redundant, or outright misleading. It’s tricky—often dangerous—to throw away legacy code that isn’t well understood. But if a test doesn’t make sense even after a modest investigation, the world will be a better place with one less mystery in it.

When in doubt, delete

It’s not only true in testing, but a certain sort of ruthlessness goes into keeping up a high-value test suite. Tests create drag, case closed. A thorough test suite takes longer to run; builds are slower to build, and changes take longer to make. That’s a good thing when something needs to behave in a very specific way, but in less-critical code it can simply slow down the speed of other, necessary development.

Focus on value. Test accordingly. If the value isn’t there, or if the opportunity costs exceed the perceived benefits, it’s time for tests to go. They can always come back later, but odds are that you won’t even miss them.

Damage Control in Distributed Systems

RJ Zaworski — Sat, 25 Jan 2020 15:04:27 +0000

It began with a squirrel, judging by the tail, an ill-conceived violation of the transformer's inner sanctum greeted with righteous thunder and an otherwise minor blip across the United Illuminating Company's grid. What took down the mighty Nasdaq exchange wasn't the incident itself, but what came after. Not the blaze of glory or the grieving mother in the family nest on the Pequonnock River, but the surge of power flowing back down the lines.

Grid operators, systems thinkers, and grey-beard sysadmins will tell you: systems are hard. Defining, assigning, and synchronizing work is tricky enough on one computer, but even simple tools can yield surprising complexity when assembled into networks.

The failure states that arise in deeply interconnected systems are as difficult to anticipate as they are to resolve, and sometimes the best we can do is simply to stop them from spreading. Keep the rest of the system running? That's a good day. No amount of thoughtful application design will resolve a problem begun somewhere else, but by embracing failure and owning its impact on the user, our services can at least strive not to make things worse.

What follows are several useful patterns for containing and responding to common faults. They're presented in stripped-down form with neither the configurability nor runtime visibility they would need in production, but they're still a useful place to start. Which is to say: study them, don't use them. Caveat emptor.

The common currency

For the sake of example, let's pretend that every action inside a big, imaginary network can be represented as a function--call it Action--consisting of an immediate Request and some eventual Response.

type Action = (r: Request) => Promise<Response>

There's nothing stopping us from representing future actions as continuations, streams, or your favorite asynchronous programming model, but we'll stick to just one of them here.

Simple enough? Good. Let's go make things fail.

Time limit

Here's one to ponder: how long can a long-running action go on before the customer (even a very patient, very digital customer) loses all interest in the outcome?

Pull up a chair. With no upper bound, we could be here a while.

The way to guarantee service within a reasonable window is to set that bound. "Service" might mean an error with our end user--but a timely error will almost always be an improvement over waiting until the heat death of the universe. And the implementation works like you'd expect. We'll start a race between a timer and some pending promise. If the timer comes back first, we'll declare the promise timed out and unblock it.

const delay = (t: number): Promise<void> =>
  new Promise(resolve => setTimeout(resolve, t));

class TimeoutError extends Error {
  readonly code = "ETIMEOUT";
}

type TimeLimiterOpts = {
  timeout: number;
};

function timeLimiter(opts: TimeLimiterOpts) {
  return function<T>(pending: Promise<T>) {
    return Promise.race<T>([
      pending,
      delay(opts.timeout).then(() =>
        Promise.reject(new TimeoutError("Timed out"))
      )
    ]);
  };
}

There's one big caveat, though: while for all intents and purposes we're carrying on as if we've "canceled" the promise, we've (in JavaScript, at least) only canceled the Request. We'll see our TimeoutError, but any resources or computations attached to the pending promise may still be running to completion, and associated references may not yet be released.

Cutting off long-running requests is a very sensible step. But what about heading off the sort of congestion that might slow them down in the first place?

Rate limit

If we've benchmarked our system's performance--and before worrying too much about loaded behavior behavior it's worth benchmarking our system's performance--we'll already have a rough sense of where and how it's likely to fail. With those numbers, we can try to keep traffic below some "known-good" threshold and preempt trouble before it starts. Here's a pared-down implementation:

type Supplier<T> = () => Promise<T>;

type RateLimiterOpts = {
  limit: number;
  interval: number;
};

type RateLimiter<T> = (f: Supplier<T>) => Supplier<T>;

class RateLimitError extends Error {
  readonly code = "ERATELIMIT";
}

function rateLimiter<T>(opts: RateLimiterOpts): RateLimiter<T> {
  let count = 0;
  let periodStart = 0;

  return <T>(supplier: Supplier<T>) => {
    return () => {
      const now = Date.now();
      if (now - periodStart > opts.interval) {
        periodStart = now;
        count = 0;
      }

      if (count >= opts.limit) {
        const err = new RateLimitError("Exceeded rate limit");
        return Promise.reject(err);
      }

      count += 1;
      return supplier();
    };
  };
}

In order to control invocation, this pattern wraps a Supplier (in lieu of a single Promise). By blocking invocations beyond a user-specified limit, we now have a crude lever to control supplier throughput.

This is useful enough, though there's plenty to be done here to improve ease of use. In a more generous implementation we might warn customers as we approach the limit (allowing them to throttle requests accordingly) or even queue rejected requests to be re-processed later.

Which raises an interesting question: when failure happens, what comes next?

Retry

Well, something broke.

Maybe it was a self-imposed limit from a timer or rate-limiter, or maybe it was a genuine, honest-to-goodness exception in some action. In either case, we'll need to decide whether to dutifully relay the failure to our customer or first attempt recovery on our own. If the error looks recoverable--a network flaked out, or a service was temporarily unavailable--we'll usually start with the latter.

The easiest step may be simply to retry it:

type RetryerOpts = {
  limit: number;
};

class RetryError extends Error {
  readonly code = "ERETRY";
}

retryer(opts: RetryerOpts) {
  return <T>(supplier: Supplier<T>): Promise<T> => {
    let attempt = 1;

    const onError = async () => {
      if (attempt >= opts.limit) {
        throw new RetryError("Exceeded retry limit");
      }

      await delay(1000);
      attempt += 1;
      return supplier().catch(onError);
    };

    return supplier().catch(onError);
  };
}

Once again we've spared some niceties in the name of a straightforward example, but that's the heart of it. We watch for an error, and (provided we haven't already exceeded a reasonable retry limit) we'll go ahead and try it again. In a more robust implementation, we would likely want a way to filter "retryable" errors from the recoverable sort, as well as some way to review all of the errors related to both the initial request and our subsequent retry attempts. We may also want the ability to "cancel" retries in-flight, or even to check in on progress. But those, too, are projects for another day.

One that's worth taking on, however, is the schedule on which our retries are sent.

Backoff

A retry policy is both a sensible first step and a terrific way to load artificial traffic onto an already-beleaguered service. If a failure may be due to request throttling or an overwhelmed service, it's a good idea (as well as good form) to take a breath before any subsequent retries.

Here's our original retryer, this time with room for different backoff strategies.

export type BackoffStrategy = (attempt: number) => Promise<void>;

type RetryerOpts = {
  limit: number;
  backoff: BackoffStrategy;
};

function retryer(opts: RetryerOpts) {
  return <T>(supplier: Supplier<T>): Promise<T> => {
    let attempt = 1;

    const onError = async () => {
      if (attempt >= opts.limit) {
        throw new RetryError("Exceeded retry limit");
      }

      await opts.backoff(attempt);
      attempt += 1;
      return supplier().catch(onError);
    };

    return supplier().catch(onError);
  };
}

const backoff = {
  fixed(d: number) {
    return (attempt: number) => delay(d);
  },
  linear(d: number) {
    return (attempt: number) => delay(d * attempt);
  }
};

Backoff isn't quite enough on its own, though. Consider the case where a surge in requests causes many clients to fail near-simultaneously. If the clients share a common retry policy, no amount of backing-off will save the coordinated surge that follows with each successive retry.

What will save them is to jitter the backoff calculation:

const jitter = (d: number) => (d / 2) + (Math.random() * d);

export const backoff = {
  linear(d: number) {
    return (attempt: number) =>
      delay(jitter(d * attempt));
  }
  // ...
};

We may still face down the original collision, but a bit of entropy sprinkled on our beautiful, deterministic backoff algorithm will at least prevent the (unintentionally) coordinated sort.

Circuit breaker

So far we've put bounds around the sorts of faults that are relatively easy to specify. Requests need to settle within a certain window? Cut them off. Services will fail under a certain load? Limit the volume of requests.

Not all production faults are so accommodating. We may not know how something will fail--but when it does, we do want to detect it and avoid making it worse.

Enter the circuit breaker.

type CircuitBreakerOpts = {
  threshold: number;
  waitDuration: number;
};

type CircuitBreaker<T> = (f: Supplier<T>) => Promise<T>;

class CircuitOpenError extends Error {
  readonly code = "ECIRCUITOPEN";
}

type State = "CLOSED" | "OPEN" | "HALF_OPEN";

const circuitBreaker = (opts: CircuitBreakerOpts) => <T>(
  f: Supplier<T>
): (() => Promise<T>) => {
  let count = 0;
  let rejected = 0;
  let state: State = "CLOSED";
  let lastChangeAt = 0;

  const setState = (next: State) => {
    count = 0;
    rejected = 0;
    state = next;
    lastChangeAt = Date.now();
  };

  setState("CLOSED");

  return async () => {
    if (state === "OPEN") {
      if (Date.now() - lastChangeAt > opts.waitDuration) {
        setState("HALF_OPEN");
      } else {
        throw new CircuitOpenError("Circuit breaker is open");
      }
    }

    count += 1;

    try {
      const result = await f();
      if (state === "HALF_OPEN") {
        setState("CLOSED");
      }
      return result;
    } catch (err) {
      if (state === "HALF_OPEN") {
        setState("OPEN");
      }

      rejected += 1;
      if (state !== "OPEN") {
        const failureRate = rejected / count;
        if (failureRate > opts.threshold) {
          setState("OPEN");
        }
      }

      throw err;
    }
  };
};

Real world circuit breakers save all sorts of incendiary unpleasantness by cutting off the power to a circuit that's exceeded its design load. Our version will open its "circuit" when a certain signal (in this case, the error rate) exceeds a user-defined threshold. After waiting in the OPEN state, the breaker will relax to a HALF_OPEN position, at which point the success or failure of the next request will either restore normal operations or trip it back OPEN.

Once again, a production version of the circuit breaker would also some ability to filter "safe" errors, as well as exposing some insight into the present state of the state machine embedded inside our breaker. We'd likely surface an EventEmitter or push state changes, failure rates, or both to a metrics collector--just as we'd ship attempts, throughput, and latency from the rest of our fault-tolerance toolkit.

Putting it into practice

There are a couple of points worth considering before putting these patterns into practice. First, will the complexity added by any of these patterns solve a pressing problem within your system? If the answer is anything less than certain, leave it out.

Second, consider whether a proven library like Hystrix, resilience4j, (or the port into your favorite language) will provide the features you need. It probably will. Leaning on it will save the trouble of verifying, benchmarking, and ironing out the kinks in your own, homegrown safety equipment.

If you're tired of chasing mysteries through a big, teetering network, though, patterns like these can help make its failures more obvious. Maybe you'll combine them--by wrapping a circuit breaker around a rate-limited component, say--to layer on the sanity-checks and fallback behaviors. Maybe they'll live in sidecar containers, in API middleware, or in proxies fronting for serverless functions.

But whatever and however you use them, the ability to recognize and handle errors is a critical part of coming to terms with the complexity of your distributed system. Assume failure. Own the bad news. And above all, don't make it worse.