Forem: Tom Warner

A Taxonomy of Software Failures

Tom Warner — Tue, 27 Jun 2023 21:12:09 +0000

(Somewhat inspired by this classic article about categorizing exceptions, which anyone who writes software should read.)

There are an infinite number of specific ways that a production software system can fail, and in my career I've encountered a vast array of them (of course, some percentage of them were my fault in the first place). But I've noticed that production failures tend to fall into a few broad categories, so my hope is that by sorting the categories out, it can be easier to anticipate and fix issues (and maybe even prevent a few. That would be nice!)

New Defect Released

This is by far the most straightforward failure. Someone wrote some new code, it got deployed, and it turned out to have an immediately obvious bug in it. This sort of issue can be mitigated by having a process of thorough code review, QA testing, integration testing, etc and it can also often (but not always!) be solved by having a robust and simple process for rolling back any newly released code. However, if for example the released code includes a dependency for something else in prod, or if it bundles some other features which can't be reverted out, your best fallback is having some sort of accelerated process for deploying hotfixes. Note that every obstacle in between writing and deploying code (review, permissions gate, deployment window, ...) which helps reduce the frequency of errors making it out to prod also slows the deployment of any fix for the errors that do make it to prod.

Existing Defect Noticed

When buggy code gets released, it's actually best for someone to notice immediately. If that doesn't happen, you may end up in this situation, where a bug has been out in production for plenty of time but it has only just become an emergency. There are many reasons for something like this to happen -- for example, a new customer may start using an existing feature in a way you didn't anticipate, or a new release may depend on the defective code in a high-impact way.

The best way to avoid this sort of thing is to avoid deploying buggy code in the first place, of course. However, since we live in an imperfect world and you will deploy buggy code from time to time, there are a few mitigations available. One is to make a priority of finding and fixing existing bugs -- decrease the friction for your users to report them, motivate developers to fix them, and generally invest in code quality. However, all that will fail from time to time, so the last fallback is again to shorten the turnaround time between "defect noticed" and "hotfix deployed".

External Failure

A large-scale software ecosystem collects external dependencies the way the floor underneath your couch collects dust bunnies and long-lost board game pieces. However, unlike the problem under your couch, which can be solved by a vacuum cleaner, you cannot go through and clean out the external dependencies because they provide solutions for your users. This is unfortunate, because sometimes these dependencies will fail, for circumstances entirely beyond your control. Sometimes third party software breaks entirely on its own, or sometimes they just SILENTLY CHANGE THEIR API FORMAT ONE DAY BECAUSE THEY ARE MONSTERS. The best you can do is proactively identify and catalog all the external dependencies that you have, make sure the systems which connect to them are as resilient as possible in case they fail, and (if possible) identify and implement ways to backup or cache external data so that you have temporary fallbacks available if you need.

Load/Infrastructure Failure

Though many would prefer not to think about this, your software ecosystem does ultimately rely on physical machines, existing in the real world, running your software on them. Sometimes these machines have physical problems. More commonly, especially if you are using the sort of cloud providers that let you mostly abstract that away, the work you're asking some piece of infrastructure to do exceeds the work that it is capable of doing, and you start running into all manner of complex and diabolical failure states. Your server runs out of memory, your database runs out of rows, your monthly subscription runs out of compute, or perhaps you have some kind of exponentially inefficient algorithm buried deep in your codebase that is suddenly running against too much data to handle in a timely manner.

This sort of thing is why devops people get paid so well. But obviously there are ways to be proactive, such as monitoring the resources you're using and the performance of your infrastructure. Having a well-architected environment in general is a huge help here because it makes it easier for you to fix or replace infrastructure on the fly.

I find it helps to be aware of these different failure cases because they cover just about anything that can go wrong with production software, so when something starts to break, an easy starting point is to try and figure out which of these (it may be more than one!) is happening. This sort of knowledge can also inform your attempts to be proactive about heading off problems before they emerge, or to be reactive and categorize the failures you've been experiencing in hopes of future improvement.

(Reposted from my personal blog here)

Git Learnt

Tom Warner — Thu, 11 May 2023 13:33:56 +0000

(ya know, if you want)

Git is a powerful tool, and by far the most effective way to unlock its potential is to use it on the command line. There's just one issue with that, which is that the syntax for Git commands is unintuitive, often overly verbose, and written at a very low level, so doing many operations you care about (like, say, "put me on the latest version of this specific branch from the remote") tend to involve several commands. And since the commands are often context-dependent, common operations may require different commands depending on the state of your local git.

So it's understandable that there are so many Git UIs, however, what I've found over the years of working with Git on the command line is that by setting aliases and local functions for common workflows, I can accomplish just about everything I need in short, simple commands. Here are some of those aliases, along with explanations of how they work.

All of these commands live in a GitHub repo where I've been collecting them, so you should be able to either clone it or copypaste them out to get them to work locally. Also, all the shell aliases and functions assume you're using a Unix-like shell (I use Git Bash even on Windows) but I'm sure if you really wanted to, you could port them over to something else like Powershell.

Put me on the latest version of this branch

This is one that somehow takes up to 3 separate commands in native Git, which is ridiculous because it's a core part of anyone's workflow. The alias I use is gfl (git-fetch-latest) and I have it configured to either take a branch name as an argument, and put you on the latest revision of that target branch, or without a branch name it puts you on the latest version of your current branch. Super useful for code reviews!

gfl() {
    if [ $# -eq 0 ]; then
        git pull
    else
        git fetch && git checkout "$1" && git pull
    fi
}

Usage:

$ gfl branch-name #pulls the latest version of branch-name from origin
$ gfl #with no arguments, pulls the latest version of the branch you're already on

Start a new branch from the commit I'm on

This is a common operation but of course the branch commands are janky and unintuitive (IN WHAT UNIVERSE DOES IT MAKE SENSE FOR git checkout -b TO CREATE A BRANCH AND git branch -D TO DELETE ONE). So you can either memorize the long form of this syntax, or make a short little alias (gcb for git-create-branch) and not have to think about it. Note that any existing local changes will not be affected by this command.

gcb() {
        git checkout -b "$1"
}

Usage:

$ gcb new-branch-name

Start a new branch from a specific tag

The workflow at my job involves using tags to mark specific builds, and mostly when we create new feature branches, it's those tags that we branch off. This is another one that's a fairly straightforward command in Git but it has a weird and verbose syntax, so I alias it to gbt for git-branch-tag.

gbt() {
        git fetch && git checkout tags/$1 -b $2
}

Usage:

$ gbt target-tag new-branch-name

Merge a specific tag into your branch

Sometimes at my job, the build we're targeting our feature branches against changes. When that happens, we should merge that newer commit (represented by a tag) into the feature branch we're working on. I use gmt for git-merge-tag to do this.

gmt() {
        git fetch && git merge $1
}

Usage:

$ gmt target-tag

Blow away all local work

Sometimes you're just having one of those days where you modify 18 files and then you realize it was all completely wrong. Right? It's not just me who has those days? Well, anyway, if you need to just wipe out every local change, you can use grhh (git-reset-hard-head). This is another one where I memorized the syntax years ago and then realized I didn't like typing the whole thing out every time I need it, but as usual the syntax is janky and awkward and there's no reason you should have to remember such a specific series of flags for such a basic operation.

alias grhh='git reset --hard HEAD'

Usage:

$ grhh #just make extra sure you don't have any local work you want to keep!

Make your submodules work

If you're dealing with submodules, it can be kind of a pain to keep them in a correct state, especially since operations that change your local branch don't touch the submodules. This is (again!) a verbose one-liner, but running it will get your submodules into exactly the state that your current branch expects them to be in--even if they're not checked out at all, or if they're nested. If you're dealing with submodules that change a lot, you may want to work this alias into the commands you use to switch branches. It's gsm for git-sub-module.

alias gsm='git submodule update --init --recursive'

Usage

$ gsm

Commit and push all your local work

A lot of the time when I'm working on a feature branch, I get to a good stopping point and I just want to push all my local work remotely. To do this in Git natively takes 3 commands, of course, but as you may have gathered, I hate typing any more letters than I absolutely must. So I set up gacp for git-add-commit-push to let me just enter a commit message and send it all upstream in one operation. The one caveat with this is that sometimes you do have changes on your local that you don't actually want to go up to the remote; for that situation I use the next command.

gacp() {
        git add . && git commit -m "$1" && git push
}

Usage

$ gacp "this is my sweet commit message"

Commit and push all your staged local work

If you have a bunch of local changes that you want to push, and some you don't, you can use this command to commit and push just the changes you've already staged -- use the git add command to stage, and you can check the output of git status (aliased below!) to see what is and isn't staged already. I call it gcp for git-commit-push.

gcp() {
        git commit -m "$1" && git push
}

Usage

$ gcp "this is my sweet commit message"

Pull from the remote, but without any weird error messages

If you have created a local branch and pushed it remotely, and then someone else has put newer commits on it, and you try to pull those commits down, Git is liable to spit out an annoying error message at you and fail. This is because technically the local<->remote linkage in Git works in two different directions, and by pushing your branch up from local, you do not default to setting it as the upstream target for when you do a pull. I'm sure there are complicated technical reasons that it works this way, but also I don't care because it's a dumb way for Git to work.

Anyway, I set up this alias (gpd for git-pull-down) that lets you just pull from the remote branch matching the name of your local branch without worrying about any of this, and also save some typing of course.

gpd() {
    head="$(git rev-parse --abbrev-ref HEAD)"

    if [[ $(git config "branch.$head.merge") ]]; then #there's already a merge target configured, just pull as normal from there
        git pull
    else
        if [[ $(git ls-remote --heads origin $head) ]]; then #there is an upstream branch existing with the same name as our branch
            git branch --set-upstream-to origin/$head #set merge target to upstream branch with same name
            git pull
        else #fail with explanation
            echo "Branch $head has no upstream or merge target! You will likely have to push first, or manually configure it"
            return 1
        fi
    fi
}

Usage

$ gpd

Print a diff of all locally changed files

This is actually one that's really easy to write and remember but I hate typing and I run it all the time, so I've aliased it down to gd for git-diff. Also I use diff-so-fancy to make the output of my diffs look frickin sweet and I suggest you do the same.

alias gd='git diff'

Usage

$ gd

Print the current status of your repo

Once again this is an easy one that I'm just too lazy to type out each time. I have it aliased to gs for git-status.

alias gs='git status'

Usage

$ gd

Hopefully these commands save you some typing, and also having to memorize some arcane and obnoxious syntax. My brain and my fingers have very little tolerance for either, so having a bunch of 2- or 3- character commands for the most common operations I run in Git directly improves my workflow, my happiness, and my sanity.

(Reposted from my personal blog here)

Azure SQL Database: What We've Learned About Automatic Tuning and Performance Recommendations

Tom Warner — Fri, 08 Nov 2019 15:51:48 +0000

Here on DealerOn's Team Mercury, we've had an exciting opportunity to build out entirely new systems in the last couple years, starting from scratch. This means that we've been very lucky to be a team that gets to work with exciting new technologies while a bunch of other chumps have to deal with the messes we left behind in the legacy systems (ha, suckers).

One of the exciting new technologies we get to work with is Azure SQL Database. Our reporting application is in sort of an unusual situation where we have to store a huge amount of data, and we want to be able to read it quickly in certain situations, but we can do almost all our writes off hours.

This is because we only care about most of this data on a daily timeframe. So while we have to import a lot of data, we run those imports between about midnight and 6 AM Eastern, while most users are sleeping. During the day, we turn around and read out this data to users as requested through our web API--this is when we start to really care about performance. It's far more of a big deal for a user to have to wait 10 seconds for an API call to return than it would be for an import to take 10 extra minutes to run at 3 AM.

What this means is that we've had to learn a lot about database optimization, and then how to apply what we've learned to our extremely specific use case. In this article, I'll focus on automatic tuning, and the performance recommendations that Azure regularly suggests for us.

Azure has three options for automatic tuning:

FORCE PLAN (we have this turned on)
CREATE INDEX (we have this turned off)
DROP INDEX (we have this turned off)

Our automatic tuning settings

I'll go into depth on these options, and why we chose to set them the way we did. Azure also offers "Performance Recommendations" that are basically suggestions to create or drop indexes if you won't let it do that automatically, and to parameterize your queries (which you should be doing all the time anyway). So if you don't set the tuning options to create/drop indexes automatically, you'll get suggestions from Azure to do them manually.

FORCE PLAN

This is basically a magic button that makes your database go faster. Without going too far into depth, sometimes SQL Server tries to run a query using a cached execution plan that turns out to be sub-optimal, usually because the parameters have changed. When you have "Force Plan" turned on, it will automatically detect when you're using a bad plan, and switch to a faster one. We turned this setting on as part of an investigation into why some of our queries were sometimes much slower than they ought to be, and discovered that it made those queries and many others faster, often by orders of magnitude.

I still feel like this is kind of a hack--if SQL Server is smart enough to detect when it's using a bad plan, why couldn't it just use a good plan to start? But in our experience, turning this setting on made a lot of things faster and didn't slow anything noticeably down. If we wanted to see these kinds of gains manually, we'd have to roll up our sleeves and dig really deep into query plans and optimization, for just about every query that we run. While it's possible that doing that right could provide us with even more benefit, it would be a huge amount of work, and we'd likely have to keep repeating it for every new query we write, as well as checking back on old queries to make sure nothing's changed. So this option makes a lot of sense--it's one button to click that provides a major performance boost for almost no effort or drawbacks. I strongly recommend turning it on, and then bragging to your managers about your deep understanding of database optimization.

How your managers will see you after you click this one weird button

CREATE INDEX

We decided to investigate the index suggestions that Azure comes up with, to evaluate whether to let it run them automatically. Overall, we found that it does often come up with good suggestions, but sometimes the suggestions it comes up with are wrong for our use case, and sometimes they're just wrong.

The right index can make a huge difference in query performance. However, every index brings performance drawbacks, so the traditional process of database tuning involves carefully considering the advantages of any new index against the drawbacks. Azure's automatic tuning claims to do the same, to the best of its ability; for example, it continuously monitors the automatic indexes it creates, and will remove them if it detects performance problems, though the criteria are not clear. However, one major drawback of the automatic tuning system is that it lacks any context of what's important to you as the user of the database. I've also found it sometimes recommends indexes that don't actually help, though I'd hope it will detect those and revert them.

Our recommended indexes

To Azure, a query is a query. There's no way for our database to know whether a specific SELECT statement came from an API endpoint (which needs a response ASAP) or a daily scheduled marketing data export (which doesn't need to be nearly as fast). Therefore, some of the automatic indexes may optimize things that you didn't really need optimized in the first place, potentially at the cost of slowing down the things you do want to be fast. A cool feature for future versions (take note Microsoft) might be the ability to mark queries by priority level, which could tell Azure what to focus optimizations on, and maybe even let higher-priority queries take precendence in conflict situations.

Automatic index recommendations also seem to miss some broader context sometimes. One of the index suggestions I evaluated involved putting a non-clustered index on two columns, to help speed up a query that uses those columns in its WHERE clause. However, on investigation, I found that this exact query goes on to update one of the columns in this index. When I applied the index on a testing database, I found that it actually increased execution time on the query by up to 50%--the time taken to rebuild the index was more than the time saved by having that index in the first place! Obviously we discarded the suggestion, so we don't know what would have happened if automatic tuning had been enabled. Hopefully Azure would have detected the regression and quicky reverted the index, but this is a great example of why you should evaluate these suggestions yourself before accepting them.

Unfortunately, the process for evaluating tuning suggestions is kind of janky. When they show up in your recommendations, they only come with the table and columns suggested, not the query(s) which actually triggered the recommendation. This has been suggested, and ignored--a classic Microsoft move to get you to use their products without thinking too much about what's happening at a lower level. There may be a way to track through the system views to find the actual query text you want to see, but I'm not enough of a wizard to figure it out. So we've been doing it the old-fashioned way instead, by looking at Regressed Queries in SSMS.

Details of an index we decided not to apply

Basically, any query that triggered a high-priority index recommendation is likely to be one of the top regressed queries. Queries in this view also include missing indexes in their details, so you can match those up with the indexes getting recommended in the Azure UI to trace which queries they should be helping. Once you know the actual problem that Azure has detected, you can take a step back and figure out how to solve it. Keep in mind that you may not need a new index to solve every performance problem, and with your broader human knowledge of the system, you might be able to find a fix that doesn't involve changing anything on the DB side. For example, how high priority is this query? Is there a way it could be rewritten to take advantage of existing indexes or other optimizations? Do you actually even need it at all? Sometimes we go investigate a query in our code, only to find out that we're not actually using its results anywhere.

If it still seems like an index might be helpful, you can test it. We use dev databases that are regularly refreshed by cloning the production databases, which generally make it very easy to test optimizations on. I normally start by running the offending query a few times with a few different parameters (if applicable) and recording those execution times, to get a baseline for performance. Then you can add your index and rerun the same queries as before, to get an accurate idea of the performance difference. If there's a noticeable improvement, you probably want to go ahead and create this index for real.

At DealerOn, we still create these indexes manually, instead of just accepting the suggestions through the Azure UI. This is so that we can integrate them into our normal workflow process, and so that we can give them descriptive names that fit into our own schema. We then monitor performance after they've been applied, to make sure they're actually helping out where they're supposed to, and to make sure they're not causing issues anywhere else.

DROP INDEX

The flip side of the CREATE INDEX automatic suggestions is that Azure will also monitor and detect when it thinks indexes are hurting your performance. We don't have much experience with this, because it's never given us this suggestion before--clearly all our indexes are doing great! However, if we ever see this suggestion show up, we'll follow a validation process like we do for the CREATE suggestions.

Since all our indexes are created manually with descriptive names, we should be able to trace them back to the associated JIRA ticket and find out when and why they were created. We can then determine whether this reasoning is still valid, or even whether it was valid in the first place. This may help us identify issues where we didn't properly account for a change we made, or with our original process that created the index, or even larger structural issues with our system design.

If you don't want to worry about that sort of stuff that's fine, and maybe you should turn on the CREATE and DROP INDEX options. In a normal scenario, they'll probably improve your overall database performance for a minimum of effort. But if you really care about optimization (like we do), or if some optimizations are more optimal than others (like in our use case), or even if you're just interested in knowing what SQL Server is doing at a lower level, I recommend you disable the automatic index tuning and instead start paying attention to the suggestions you get.