Forem: Gregory Paciga

The myth of “unstable” code

Gregory Paciga — Tue, 13 Feb 2024 15:00:00 +0000

The most common reason I get for people delaying test automation is that the code is “unstable”, and automating too soon will result in a lot of re-work of automation. Better to only automate after a feature is “stable”, i.e. development work is all done, they say, so you won’t have to rework any of the automation. This belief is a myth that makes automation harder in the long run, not easier.

Your code is not unstable

This might feel like quibbling with semantics, but consider these two statements:

Our code is unstable
Developers are changing the code

When I probe what people mean by “unstable”, they often consider these two statements to be the same. To me, they could not be more different. “Unstable” to me means unpredictable, that it exhibits different behaviour at different times, that there are unknown variables at play, that it is unknowable.

This description should, hopefully, bear little resemblance to how your developers change code. There can be a period when developers have to experiment with how to implement something and things truly changing quite a lot, but in 90% of development work developers are working against specific requirements, acceptance criteria, and planning. You, as a QA, should be a part of those planning discussions. If you are so disconnected from the changes the developers are making that it makes the code feel “unstable”, then you have a teamwork problem, not a software problem.

If your code truly is “unstable” in the sense that it doesn’t behave deterministically, then that’s a product issue. Either it’s deliberate and you need to figure out how to test for that, or it’s not deliberate and it’s a problem that needs to be fixed.

Your code will never be “stable”

Let’s say we really do mean that we want to wait for the developers to be finished before doing automation. What does it mean for developers to be finished?

In a waterfall or project development model, where there is a limited scope and a well defined beginning and end to the project, you might be able to do this. There are other reasons why you shouldn’t want to (which I’ll touch on below), but it could work if you can get a real commitment that development stops and you’ll have a long enough “test implementation” phase where you can automate.

In product development, though, where you have a continuous backlog of work that the team moves through with incremental releases, this is impossible. If you wait until devs are done one set of features to write test automation, that set of features might be stable but the code as a whole will still be unstable because the devs are still changing it for the next set of features. Again, you might be able to work this way if you have good controls to keep separate environments for each version of the application.

With ongoing product work, the product is never finished, so it will never be “stable”. This no longer works for a reason to delay automation.

Waiting doesn’t prevent re-work

The third part of this myth claims that by waiting to automate, you will prevent re-work.

Again, this depends on the same fallacy: the only way automation will never require re-work is if nothing about the application ever changes. For well defined standalone projects, maybe this can be true. But I’ve yet to experience that.

The re-work required for automation is actually worse when you delay it based on an unattainable standard of “stability”. As I’ve written before, waiting to automate triples the amount of extra effort required.

Debunking the myth

To summarize, given the mistaken belief that “we have to wait to automate because the code is unstable and we want to avoid re-work”, consider instead:

Your code isn’t unpredictable or unknowable: it is being changed deliberately by other developers on your team. Work with them so that you know what changes are coming and can plan for them.
Your code is likely never going to be “stable” because software development is usually an ongoing exercise in maintaining and extending products. In most cases, if you wait for your code to stop changing, you will be waiting forever.
Re-working automation is unavoidable and only gets more expensive the longer you delay keeping your automation up to date.

In limited circumstances, such as an early POC that will be thrown away before going to production or a waterfall-style standalone project, waiting to automate may make sense. In most cases, however, you will be able to have a much bigger impact on the quality of your product by automating early.

Three ways that “Manual Testing” is a waste*

Gregory Paciga — Tue, 23 Jan 2024 16:00:00 +0000

* when done at the wrong time, and depending on how you define “manual testing”.

Callum Akehurst-Ryan had a post recently that broke down much better ways of thinking about manual testing, but I’m using it here how it’s used at my current company and presumably many others: scripted tests that have to be run manually to see that they pass before a feature can be considered done. But that’s not the whole context: let’s first consider how manual and automated testing relate to each other.

Context: models for when automation happens

I noticed last year when conducting dozens of interviews for test automation specialists (we call them “QA Engineers”, you might call them SDETs), people seemed to fall pretty cleanly into one of two camps:

You have to manually test something first so that you know it works before you can automate it. Typically, people in this camp will only tackle test automation after a feature is done and “stable”, which usually means that the tests are automated one sprint after after development and manual testing are done.
Automation (and indeed, testing as a whole) should be treated as part of development itself, so a feature can not be considered done until the automated tests for it are implemented.

Colloquially people refer to these as “N+1 automation” or “In-sprint automation”, respectively. It is certainly possible that the first mode all takes place within one sprint, but it’s usually unrealistic to simply condense the timeline that much. (People against “in-sprint” often mistakenly think that is the only difference and therefore throw it out as impossible for just that reason.) The difference really comes in the order things occur:

N+1 automation: a feature is considered done after all manual tests pass, and now automated tests can be implemented without fear that the feature will change. Usually, this is done in a new sprint.

In-sprint automation: manual tests cover what automation can’t, and the feature isn’t considered done until both are complete. The steps don’t have to happen serially as shown, as long as all three are part of the definition of done.

The reason that the second model looks to some like doing 2 sprints worth of work in 1 is that N+1 automation is inherently wasteful.

Three wastes of manually testing before automating

(You’re now free to drag me in the comments for putting the onus of waste entirely on manually testing in the post title.)

1. The automated tests will always be out of date

Consider what happens when all manual tests pass, so the feature is marked done and deployed one afternoon. That night, your automated test runs against this new code. On Thursday morning when you come in to review the results, what do you find?

Any automated tests that depended on the way your code behaved before this new change will have failed. Your tests are out of date. Can you rely on the results of your tests to tell you anything about the health of the system now? Are you sure that the tests that passed should have passed with the new code? These automated tests aren’t any good for regression anymore. They may be better than nothing, but it’s a lot harder to interpret.

Since it may take a sprint of work to update all those tests based on a sprint’s worth of dev work, you can expect your tests to be broken for a full sprint on average. Then, of course, the cycle repeats because in that time a new batch of dev work is done.

Congratulations, you’re now spending the rest of your life updating “broken” tests and will never expect to see your automated test suite pass in full again. I’m not exaggerating here — I’ve talked to people who literally expect 20-50% of their test suite to be failing at all times.

2. You’re doing something manually that could have been automated

Since you’re comfortable with calling the feature done after manually testing it, you must have manually tested every scenario that was important to you. These are likely going to be the first things that are worth automating, so you don’t have to manually test them again. But then, why didn’t you just automate it in the first place? If you claim that writing and running a manual test is faster than automating it (a dubious claim to begin with), how many times did you repeat that manual test during the development sprint when something changed? How much extra overhead is it going to be to figure out which manual tests have to be repeated after automation is in place and which don’t?

You’re testing something with automation that you already tested manually. You’re wasting time testing it again. You’ll get the benefit of being able to re-run it or include it in regressions going forward, but you’d have had that from automating it up-front as well. You’ve only just delayed that benefit by a full sprint and gotten nothing in return.

3. Everything is more expensive

By the time you get around to automating a feature from last sprint, your developers have moved on to something else. When you inevitably need help, need some behaviour clarified, or even find a problem that manual tests missed, the context switch required for the rest of your team is much larger than if that story was still part of the current sprint. People have to drop work they committed to in the current sprint to remember what happened last sprint. Often the “completed” feature has already gone to production and the cost of fixing new issues has also gone way up. Something that might have taken a 5-minute back-and-forth with a developer to fix while the feature was in progress will now require a lot more effort. Easy but low-priority fixes will be dropped as a result, and your overall quality suffers.

One answer to this may just be that you should have caught it while manually testing. And that may be true, but you didn’t, so now what? Automated and manual tests have different strengths and will always raise different bugs. The value you get from automation is now much lower because even if you find all the same bugs as if you had automated earlier, the cost to do anything about them is higher.

Automate what should be automated

In my experience, these forms of waste far outweigh any benefit you get from only automating “stable” features. Instead of updating and re-running automated tests when code changes, you’re performing the same manual tests multiple times. Instead of getting the confidence of a reliably passing test suite, your tests are always failing and require extra analysis time as a result. Instead of working with developers and improving quality, automation becomes a drag against the team’s dev productivity.

By automating tests first, as part of the development work, manual test effort can focus on the sorts of tests that humans are good at. Automate 100% of the tests that should be automated, as Alan Page might say, rather than wasting time pretending that you’re a better robot than a robot is. You’ll get more value from both your automated tests and your manual efforts that way.

If you’re worried that you won’t have eyes on every path if some are automated up-front, remember that most automation still requires stepping through the application and observing its behaviour while you write the code to execute those same interactions. A good automator will watch each new tests and make sure they can all fail, thus writing automation also gives you manual—or rather, human—coverage.

The change in sequence alone may not be enough to get you all the way to in-sprint automation, but you should see it as much more attainable without the extra waste holding you back.

The Secret Skill on Your QA Resume

Gregory Paciga — Thu, 21 Dec 2023 22:49:51 +0000

There’s a particular skill that is on full display in your tester, QA, or quality engineer resume that you may not even realize is there: the ability to be concise.

The most common resume structure I saw this year went something like:

A summary section with bullets describing a person’s experience and skills
A table of technical skills
Job experience with each job getting another set of bullet points

It’s not an exaggeration to say that each one of those components was often a full page or more. When that happened, I would notice the same details being repeated multiple times, as if one thought it would emphasize that skill more or make it more likely to get past an automatic filter. Selfishly this is a problem because it means the information density is lower: I have to read more to get the same level of detail. But ok, I can slog through that if I have to. However, in an indirect way, it can also flag problems with your approach to testing.

Since there is always an infinite amount of testing we could do, but a limited time to do it in, an essential skill for a tester is knowing how to avoid repeating themselves. Testing something twice doesn’t tell you anything more than testing it once. I touched on this when describing some bad reasons to test and the importance of each test having a purpose. It’s also a fundamental part of the testing pyramid: the top layers can be smaller because you don’t need to repeat things you’ve tested at lower levels.

This is especially true in test automation, where we have to be economical with which tests we implement and with how much code we write, by using good coding practices like, among others, the “don’t repeat yourself” principle. Code written twice is more than twice as much work to maintain, since there’s overhead in remembering something was done twice in the first place.

Now apply this to your resume: why tell me 4 times that you’ve worked with Selenium when once will do? What value does that add?

Some tips that might help keep things concise:

If you use a table or sidebar to list a bunch of technology that you’re familiar with, you don’t need to repeat them again in full sentences under each job.
If you’ve done something similar across many jobs, consider putting in into a summary section rather than repeating the same sentence every time. Use the space you have for each job to highlight what is different or unique about what you did while there.
Don’t include many variations to describe the same work. Filler phrases like “Test Design”, “Test creation”, “Test Scenarios”, “Test Cases”, “Test Scripts”, and even “writing tests” all effectively mean the same thing (because they all mean something different at every company). Instead of stuffing your resume with all of them, include only the ones the job posting asks for. The same goes for other filler like “Defect Tracking”, “Defect Management”, “Defect Reporting”, “Defect retesting”, “Test Reports”, “Regression Testing”, “Sanity Testing”, “Smoke Testing”, “Functional Testing”, “System Testing”.
Focus on what differentiates you from anybody else working in the same field. I called those phrases in the previous point “filler” because they mean very little and almost everybody has them. All things equal, I’m not going to make a hiring decision based on one resume including “regression testing” or not. Unless the job description mentions it, or you really do have a depth in one area that is worth highlighting, leave it out.
Avoid needless detail. A sentence to tell me you “use locators to identify elements on the page” doesn’t add much if you’ve already mentioned working with Selenium. I’ve also seen multiple bullet points dedicated to “creating branches, pulling updates, pushing code, and merging code” in a single resume where just listing “git” as a skill says the same.
For each bullet point you write, ask what information it adds. If all the information it adds is already found elsewhere in your resume, delete it.

If I notice a lot of waste like this in your resume, I am going to wonder about how efficient you will be when it comes to writing either a test plan or an automated suite. Be concise.

You’re now free to get in the comments to tell me how I could have made the same point in half the space.

Bad reasons to test

Gregory Paciga — Fri, 11 Aug 2023 01:48:57 +0000

“Complete” testing is impossible, so we have to do the best we can with the time and resources we have. Often, that comes down to making sure that there’s a good reason for each test to exist. If there isn’t, then you should remove it and spend that time on something more valuable. When evaluating test suites, I will ask, “why do you have this test?” or “why are you testing this?” To me, a good reason is that the test addresses some specific concern, risk, or functionality that no other test does. But, I’ve heard plenty of bad reasons, too…

To cover all the scenarios

Ok, but which scenario does this test cover? Do you have a list of “all” the scenarios? Is each one unique? What makes them unique? Who gave you that list? Is it ever possible to cover them all if you did have such a list?

To get more confidence

What are you more confident about after running this one test that you weren’t confident about before? Could you run another test to give you that same confidence? Why not? What are you worried about happening if we didn’t run this test?

To see that it works

To see that what works? It can’t be the product or application, because that’s what all the tests do. What could be broken if we didn’t run this test?

Because the Product Owner told us to

What does the Product Owner know that you don’t? Do they have a reason? You should probably know what it is too. Maybe there’s a more effective way to test that aspect, but you have to know what aspect they’re interested in first. If they don’t have a reason, it’s probably part of your job to steer them towards more useful tests and, perhaps, illustrating why this one isn’t necessary.

Why I care

If you don’t know what a test tests, how are you going to interpret it when it fails? Tests are only valuable if they tell us something useful when they fail. Each test costs us in time to design, write, run, analyze, and maintain. The only way to know that we’re making the most of the time and resources we have is to be sure that each test adds specific value. Without that, you’ll never really know which ones are worth including and which should be excluded.

You might not need to fix your flakey tests

Gregory Paciga — Tue, 08 Mar 2022 15:00:00 +0000

There’s a toxic cycle that can be caused by flakey tests. The more unreliable tests are, the less likely you are to trust their results. At best, it means wasted time as people re-run a failing step in an automated pipeline until they get the result they want. It then takes 2 or 3 times longer to identify when a failure is “real” compared to noticing it the first time they fail. At worst, it means the test results get ignored completely. “It’s probably fine, it always fails,” someone might say. This is often an excuse to keep tests out of automated pipelines, or leave them as non-blocking. “We can’t let failing tests get in the way of releasing, because failing tests are never real.”

The feedback loop comes in because as soon as you start ignoring one flaky result, it becomes easier to ignore two or three flaky tests. The less you look at test results, the easier it is let them get even flakier. As they get even flakier, you trust them even less, and the loop repeats.

There’s two obvious ways to break out of this cycle: you either have to fix the tests, or finally admit that they aren’t adding value and get rid of them.

Often people take for granted that the right thing to do is to fix the tests. People outside a team where this is a problem will ask “why doesn’t the team just fix the tests?” as if it’s not an idea they considered. Maybe it’s a matter of time, competing priorities, or expertise. Maybe someone just has to insist on making the tests blocking and accept that they can’t release until they’re made reliable. But that’s the obvious route to go because we default to believing that tests are good, and more tests are better.

The rarer question, it seems, is: “are these tests actually useful?” It could very well be true that the one flakey test is also the only one that actually catches real bugs. This is always the excuse to default to keeping every test: “No I can’t remember the last time this caught a real bug, but what if one day we did introduce a bug here?” However, when we find ourselves so deep in the toxic feedback loop that people barely even notice when the tests fail, it’s very likely that people wouldn’t notice if those tests ever did catch a real bug anyway. Especially if it’s an intermittent bug. (And sometimes, a flaky tests is flakey because the app is flakey, but tests get the blame).

How do you know when it really is safe to cut your losses and delete the flakey test? One way to think of it is to remember what tests are actually for. We don’t have tests just to see them passing or failing; on one level at least, we test in order to catch issues before they make it to production. So, are the flakey tests helping that goal, or hindering it?

I recently spoke with a group with this exact problem: automated testing wasn’t being included in their release pipelines because it was too flakey, and it was hard to put effort into fixing flakey tests because they were too easily ignored outside the pipeline. But here’s where context matters: their change fail rate was consistently well under 10%. Products had 2 or 3 production issues per year, and each was usually addressed in a few hours. For this group, that was quite good. Nobody on the team was stressed by it and their stakeholders were happy. And this was in spite of the fact that big pieces of automated testing were frequently ignored.

While there is always a risk that dropping tests will allow more issues going unnoticed until it’s too late, in this scenario that isn’t going to be as much of a concern. If it’s true that tests already weren’t doing their job with no ill effects in production, then there was little point running them in the first place, and even less point in investing time to fix them. Their process was already safe enough without them.

Context here matters more than anything else, but just maybe you don’t need to bother fixing those flakey tests after all. Would anybody notice if you just deleted them instead?

Coffee break tests: An exercise in test design

Gregory Paciga — Tue, 01 Mar 2022 03:00:00 +0000

Let’s think about an example test scenario.

We’re given a program called MoverBot that is responsible for distributing a data file created by another program, OutputBot, to several servers. Since the two programs operate asynchronously with variable runtimes, OutputBot and MoverBot communicate about which files are ready to be distributed through a shared database entry. OutputBot adds the new file to the database when it is ready to be distributed, and MoverBot marks when that file is done.

The “happy path” might be written like so:

Given a file to be distributed and an entry in the database,

_ When _ MoverBot runs,

Then the file is copied to each server.

Now, we should also consider the case where MoverBot shouldn’t distribute the file:

Given a file to be distributed and there is not an entry in the database,

When MoverBot runs,

Then the file is not copied to any server.

Can you spot the problem with that test?

Early in my testing career I included an equivalent test in a test plan for something a lot like MoverBot, but luckily it didn’t survive peer review. The problem is that instead of running the test, you could go for a coffee break and come back to the same result. Like so:

Given a file to be distributed and there is not an entry in the database,

When I go for a coffee break and come back without running anything,

Then the file is not copied to any server.

In other words: given any scenario at all, when I do nothing, nothing happens.

To improve this, we need to ask: what bug (or risk) is this test trying to catch (or mitigate)? Once we answer that, we can do a better job of designing the test. These “negative tests”, if you want to call them that, aren’t just for exercising some behaviour. They need to demonstrate that some erroneous behaviour doesn’t exist.

In this case, what I wanted to do was show that the database entry actually does control whether the file is distributed or not. I needed a control to show that MoverBot would have distributed the file without the database entry given a chance. In other words, I needed to show that MoverBot was perfectly capable of distributing files, it just opted not to for this case. This is the revised test:

Given two files to be distributed and only one has an entry in the database,

When MoverBot runs,

Then only the file with the database entry is copied to any server.

We know now that if MoverBot wasn’t reading the database entry correctly it probably would have happily distributed both files. Since it only distributed one file, it must be respecting the database entry.

This particular example is equivalent to running both the first “happy case” and the “coffee break test” at the same time. One might argue that given the happy case is there and presumably passes, you don’t need the control in the negative test. However, that is only true if both tests always run together and you can guarantee there are no bugs in how the second test is set up. Remember, automated tests are code, and all code can have bugs!

Including controls directly in your tests are a way of proving, in a self-contained way, that the test setup is correct, and that you would have seen the bug manifested if it existed. They let you say with confidence that the only reason nothing happened is because nothing was supposed to happen. They aren’t always necessary, but if going for a coffee break would give you the same result otherwise, it’s a smart move to include them.

Testing in 2021, according to my Twitter bookmarks

Gregory Paciga — Tue, 11 Jan 2022 15:00:00 +0000

My annual roundup of all the things about testing (and working as a programmer more generally) on Twitter that I found interesting enough to bookmark.

“A picture is worth a thousand assertions“, quoting Angie Jones from the Test And Code podcast (but remember that all those assertions happen at once without being itemized)
Angie Jones also helped break down the last two years of development in WebDriver and Cypress.
John Cutler identifiedan overlap between observability and product analytics, which also works for testing.
Ben Sigelman pointed out the difference between observability and monitoring, and how the compliment each other.
Some tips from Google on doing code reviews.
This thread started by Nikema Prophet has a lot of great books recommendations on writing code.
Trish Khoo called the idea that testers have a mindset different from developers “one of the dumbest myths I’ve ever heard“, eleven months before I also wrote about it.
This quote from W. Edwards Deming struck me as important for my work in metrics, and is still something I need to dig into more: “It is wrong to suppose that you can’t measure it, you can’t manage it“
Speaking of metrics, Theresa Neate raised an interesting question: what are “vanity” testing metrics vs actually meaningful metrics?
I didn’t get a chance to participate myself, but I loved this idea from Ministry of Testing to run test automation challenges and have us compare solutions.
Amy Tobey suggested renaming “machine learning” to “automated bias” and I am on board.
Nat Alison emphasized how important good documentation is for engineer productivity, likening it to memoization (i.e., storing the result of an expensive function call).
Charity Majors shared some awesome slides about why software should auto-deploy in 15 minutes.
This was a great roundup of unpopular opinions about testing.
Rob McCargow suggested reading “9 Rules for Humans in the Age of Automation” by Kevin Roose.
Julia Evans, known (by me at least) for her great zines on programming, put together a neat choose-your-own-adventure game about debugging a computer networking problem.
We’ve all seen photos showing examples of products that look like unit tests worked but the integration tests failed. Johanna South posted a great example of the inverse: when integration passes even though unit tests failed.
I found these tips on coaching helpful at a time when I was struggling to have an impact.
Some real perks of a good employer.
Some great prompts for articulating hypotheses and research questions.
I can’t find an original source, but Prince Charles and Ozzy Osbourne showed up as a fun warning about personas.
This is still a work in progress, but I’m very interested to see what Richard Bradshaw and Mark Winteringham will come up with for their proposed test automation curriculum.

I’m using twitter less compared to previous years, but you can still find my similar summaries from 2019 and 2020.

Gatekeeping in testing

Gregory Paciga — Tue, 21 Dec 2021 15:00:00 +0000

We often talking about gatekeeping in testing as a problem in the sense that testers shouldn’t be the ones that decide when something goes out to production. But “gatekeeping” can also be used in the sense of excluding others. In fan communities you might hear “you aren’t a real Marvel fan if you’ve only seen the movies”. In software dev, a common example is “you aren’t a real web developer if you only know vanilla Javascript” or “you aren’t a real developer if your github commit history isn’t all green.”

I’ve become somewhat jaded with the testing discourse online in part because I think testers, as a community, tend to be guilty of this as well, albeit less explicitly. When the idea for this post first came to mind, it wasn’t until I saw few recent viral moments on twitter about gatekeeping developers that I made the connection.

Probably the most obvious example is the consistent myth that there is a “tester’s mindset” that precludes developers from being able to test their own code. This is what enables a “gatekeeper to production” role to develop in the first place, and creates bottlenecks as multiple developers funnel every ticket through limited testing people on a team.

Plenty of ink has already been spilt on why this false dichotomy is detrimental. But, we do it in more subtle ways as well.

Testers of all stripes, famously and annoyingly, love to debate terminology. Take the “testing vs checking” question. Some well-known testers feel very strongly that these need to be differentiated, and that it is detrimental to the craft of testing if we aren’t careful about which we use. I understand the historic contexts that led to a need for that distinction (for one, it was a reaction against “let’s just automate everything”). But never once have I seen the value of making that distinction when talking to a developer on a project. The only people who do are testers talking to other testers about testing, and failing to make that distinction can be flagged as a sign that you aren’t a real tester.

In that same category I would add much of tester jargon and debates: “oracles”, “heuristics”, whether a tool is a test tool or not, testing vs QA (“but you can’t assure anything”, the real testers say). I once saw a conference speaker refer to the work of “Jerry” (no last name) several times before someone in the crowd had to stop them and ask “who’s Jerry?” (Are you a real tester if you haven’t read any Gerald Weinberg?)

Again, these are subtle definitions and debates that can be interesting to testers talking to testers about testing. We run into problems, however, when we expect that every person who tests should already know, understand, or care about them. They are almost never interesting to a product team talking about testing an application.

So, yes: put all of those on the list of things I no longer care about as a tester. I’ve written about that before. More importantly, put them on the list of things I don’t expect anybody else to care about either. They’re not prerequisites to being a good tester nor being able to test. Plenty of testers do good work outside the twitter-sphere and conference circuit. We should not be using these inside-baseball details as shibboleths to filter out the real testers and exclude those who would be allies.

Going deeper on “Should we automate each negative test?”

Gregory Paciga — Tue, 02 Nov 2021 14:00:00 +0000

In recent article on the Ministry of Testing site, Mark Winteringham asks: “Should You Create Automation For Each Negative API Scenario?” In short, he answers that which scenarios you automate will depend entirely on what risks you’re trying to mitigate. While I’m on board with the idea that each test should have a reason behind it, I would have tackled the question differently, because I think there’s a more interesting question lurking beneath the surface.

Let’s use the same example: an API that validates an email address by responding either with 200 OK or 400 Bad Request. In this context, a “positive” test scenario says that a valid email will return a 200 OK response. A negative scenario would say that an invalid email should get a 400 Bad Request response. Now we can break the question down.

Should you create automation for one negative API scenario? The answer to this is unequivocally yes, absolutely. Without at least one case of seeing that 400 response, your API could be returning 200 OK for all requests regardless of their content. Along the same lines as my claim that all tests should fail, one test doesn’t tell you much unless it shows a behaviour in contrast to some other possible behaviour.

Should you create automation for all negative API scenarios? The answer to this should also obviously be “no”, for the same reason that you can’t automate all positive scenarios. There are infinitely many of them.

Now, should you create automation for each negative API scenario? I’m not sure whether this is any different from the previous question, except for the fact that “each” implies (to me) being able to iterate through a finite list. This question actually can’t be answered as asked because, as Mark points out, there are infinite ways that an email could be considered invalid.

The more interesting question I would pose instead is: Which negative scenarios should we automate?

Yes, this still depends on what risks you’re interested in, but even when addressing a single risk we can add a bit more information. It is safe to assume in this example that an invalid address getting into the system is going to have some negative effects associated with it, otherwise this validation API wouldn’t exist at all. But to address that singular idea of only admitting valid emails there are still an infinite number of ways to test it.

At the risk of being too prescriptive, something we can likely do is break down the behaviour into two pieces:

Does the API return 400 Bad Request when the email fails validation?
Does the email validation function fail for all invalid email addresses?

The first question is now much simpler. Using one example of an email that we know fails validation should be sufficient to answer it. We’ve essentially reduced an infinite number of possible negative test cases into a single equivalence partition; i.e. a group of negative test cases from which any one is capable of answering our question. If you like formal math-y lingo, you might call this “reducing the cardinality” or “normalizing” the set.

The second question now says nothing about the API at all and we can hopefully tackle it as unit tests. This does stray a bit into greybox testing, but I’m not against that sort of thing if you aren’t.

Our job isn’t done, though. This still should raise some questions:

Are there other ways the API could fail?
Which negative scenarios should we automate at the unit level?

Let’s take each in turn.

For the first, even if the answer is yes, we should try the same trick of reducing many variations of errors into distinct modes or equivalence partitions. You might be interested in other boundaries as well. An empty email address, or no email field at all, are both distinct from the case of an invalid email. If there are other response codes that the API could return, like 401 Unauthorized or 404 Not Found, there’s a good chance you’ll want one case for each of those. You’re unlikely to need more than that, though, unless there are multiple distinct reasons for returning the same response. You could get deep into the intricacies like invalid JSON or changing the HTTP headers, but at that point you definitely need to ask if those are risks you’re worried about enough to put time into.

Now the second question. You’ve probably already caught that this is the same as the “interesting question” we started with. We’ve just bumped it down to a lower level of testing. At least that means we’ve made each example cheaper to test, though there are still infinitely many.

No matter what subset of the infinitely many invalid emails you pick to test, you should still be able to articulate why each version of “invalid” is different from each other version. Can you tie each one back to specific product-level risks? Probably not, in all practicality. At a product level, invalid is invalid. I doubt you’ll be able to get anybody to say that testing one invalid email mitigates a different risk than any other invalid email. Unfortunately, it doesn’t follow from that that you only need to test one, because it is still true that whatever method your product is using to validate email address could be flawed.

Remember that ideally we are working at the unit level here, so hopefully you agree that it is fair to go into whitebox testing at this point. At worst, you might end up reading RFC 5322 until you go cross-eyed trying to identify what actually makes an email valid or not. If you did, you could devise one negative example from each of the specifications in that RFC. More likely, you will find that either the product is using a much simpler definition of “valid” than the actual email specification, or it is using a 3rd party library.

In the former case, your product team has to accept the risk of rejecting potentially valid addresses, but at least understanding the product’s definition of “invalid” will define a much more narrow set of test cases. Each negative test you use should map directly to your product’s definition. You could have fun coming up with valid emails (according to RFC 5322) that your product calls invalid, or vice verse—I once had to do exactly this as a way of needling our product team into improving our home-grown definition of a valid domain name, which has its own similarly complicated RFC. If your product’s definition is changed to account for those counterexamples, they are good candidates to retain in your tests. If not, it can be helpful to keep them as examples of how your spec knowingly diverges from the official spec, but make sure the test is explicit about the difference between a feature or a known bug. Future generations may look at specs like that and wonder whether there was a reason you have to accept invalid (or reject valid) emails, or if it was just a case of cutting corners in your validation. That is, they need to be able to know if it is safe to change those kinds of behaviours. (“Should you write tests for known bugs?” is a good topic for a separate discussion).

In the latter case—using a third party library—then you’re likely not too interested in testing the internals of that too much. Your scope of testing is now defined by “how much do we trust (or distrust) this library?” If the answer is “not at all”, then you’re back to the RFC and trying to violate each example in turn. If the answer is “completely!” then you likely don’t need any more than a few broad examples (as long as they are all different from each other). One technique that sometimes works is to come up with the most outrageous input possible so you can say “if it knows this is valid, it probably understands much more normal input too”. The trick is being deliberate about your choice of “outrageous”.

Finally, if you’re one of the unfortunate people who doesn’t have the option of moving scenarios down into unit tests, you’ll still have to answer these same question at the API level anyway. My advice would still be to have one test as the canonical answer to “does an invalid email get rejected”, and have a separate group of tests that are explicitly labelled as testing what it means to be an “invalid email”. Then the reason for having each set is, at least, explicit. You can still test changes to the definition of “valid” separately from changing the API’s reaction to it.

I recognize that getting into the weeds of email’s RFC specifications is not likely what Mark intended with this specific example, but I think the lessons here still carry over to other features that don’t have public standards behind them. You can’t test each negative test case. You can limit the scope of what “negative” means based on the level of testing you’re in. You can keep testing at higher levels simple by building on the tests at lower levels. And, you can reduce infinite negative examples to distinct classes to test one example of each.

(Re-reading this later, I realize there might also be some subtle terminology things that change the question: what is the difference between a “test”, a “test case”, and a “scenario”, for example? I know testers love bickering about terminology but I tend to lump things together. If your preferred definitions would change the above, feel free to mentally substitute in whichever words you think I’m actually talking about.)

The Gambler and other fallacies in Testing

Gregory Paciga — Tue, 26 Oct 2021 14:00:00 +0000

I just listened to Episode 3 of the Ministry of Testing’s TestSphere Roulette podcast series, and something about the conversation irked me. The discussion was centered on the Gambler’s Fallacy card, which says:

The human tendency to perceive meaningful patterns within random data.

Specifically, it usually refers to a gambler playing a game of chance who might think that past results can tell him something about what is likely to come up next. In a game of roulette, after seeing a string of red, we might be tempted to think that black is “due” to come up next. Or, possibly, that red is on “a streak”, and therefore more likely to come up again on the next spin.

The examples on the TestSphere card, though, describe what I think are quite different scenarios:

Creating tests for every bug as they’re found, so in a few years people wonder why there are tests for such obscure things.
Repeatedly going back to test the same things that have broken in the past.
A very small portion of your user base being very loud in app store reviews.

The conversation on the podcast focused on these three examples and how people had experienced them. It wasn’t until I pulled out the Gambler card myself and read through it again that I realized what bugged me. There was nothing wrong with what anybody said. The problem I have is that none of those examples on the card are examples of the Gambler’s fallacy at play, because bugs aren’t random data.

I suspect a lot of us have experienced some flavour of the Pareto principle in testing. It usually goes something like this: 80% of the bugs are caused by 20% of the code. I work on a very large web app and I would say most bugs by far come from either CSS visual layout or mishandling of malformed data coming from one of the APIs. If bugs arose randomly, it would be a case of the Gambler’s fallacy to believe that past CSS bugs are a predictor of more future CSS bugs. The rational belief would be that CSS bugs should arise in proportion to how much of the app is CSS. But my experience as a tester tells me that isn’t true. In fact, there’s a different TestSphere card—the History heuristic—that reflects this: “Previous versions of your product can teach you where problematic features are or where bugs are likely to turn up.” (I’m actually surprised there’s not an orange Patterns card in TestSphere for Pareto.)

This argument also applies to the user reviews example: a lot of angry reviews from a small portion of the users might be because they’re all about a significant bug that only affects that one segment of users. The Gambler’s fallacy warns that if reviews are random data, then a string of reviews from one small segment does not make the next review any more or less likely to come from that same segment. But the reverse is probably true here: a string of reviews from one small segment of your users suggests that there might be a correlation, and you should expect more reviews from that same segment in the future unless you change something.

(Sidenote: The Gambler’s fallacy sometimes doesn’t even apply that well to gambling for the same reason. There’s an interesting Mathologer video that walks through the math of seeing 60 red roulette spins out of 100. Despite what the Gambler’s fallacy might suggest, you actually should bet on red for spin 101 because it’s likely that the wheel isn’t actually random, i.e. it has a bias towards red. This is also why counting cards works.)

Of course, all of this raises a question: how does the Gambler’s fallacy apply to testing? In order for the Gamer’s fallacy to really apply, we need to be looking at something with random data. And aside from a few specific cases, like where your product is actually dealing with randomness, it’s hard to see scenarios where this comes up day to day. At a stretch, we might be able to say that sufficiently rare events are as good as random. For example, one request out of a billion failing in some weird unexpected mode. Even cosmic rays can cause one-off misbehaviours! An unfortunate string of cosmic bit-flips should not necessarily be taken as evidence that your product is exceptionally prone to them. This might be a case of the accident fallacy: misbehaviours are usually caused by bugs, and cosmic rays cause misbehaviours, therefore cosmic rays are bugs. But then we’re getting into a tangent about risk tolerance and probabilities.

What the TestSphere card (and the podcast) was instead suggesting was that a bug occurring once shouldn’t be taken as evidence that it will occur again. I think Gambler’s fallacy is the wrong label to put on that idea, but it is worth considering on its own. I certainly agreed with the speakers on the podcast that it is important to prune our test suites, and regularly review whether the tests that it contains are valuable. However, I don’t think you can extend that to saying something as black-and-white as any assertion that tests for old bugs are unnecessary. It is difficult even to say “this bug is now impossible and thus shouldn’t be tested for”, since implementation could change to make it possible again. How likely is that? As usual, the answer is annoyingly context dependent. Deciding which tests are worth doing given finite time is one of the great arts of testing. Likely 90% of automated tests written will never catch a bug because it is impossible to know in advance which bugs will happen again and which won’t. But it doesn’t necessarily follow that it is good strategy to reduce your test suite by a factor of 10x. Nor does it follow that you shouldn’t add a test for a bug you’ve seen today.

There can also be a belief that if a test exists, it should exist. This is the is-ought fallacy. In trying to justify it, you might run afoul of the Historian’s fallacy by thinking that if someone wrote that test in the past, they must have had a good reason for doing so, and that good reason is reason to keep it. In reality, we may have more information about our product today than the testers of the past had, so we might come to a different conclusion. I’ve also seen people use the Appeal to Tradition fallacy – “we test it that way because it’s always been tested that way”. That’s not much of a test strategy either, I dare say.

At the end of the day, what I really got out of this whole discussion is that it’s great fun to read through Wikipedia’s List of Fallacies and think about all the ways in which they are misused in testing. Our job is often justifying what tests are worth doing, whether we think about it explicitly or not. It’s worth being able to recognize when our logic leaves something to be desired.

Three ways to make metrics suck less

Gregory Paciga — Wed, 22 Sep 2021 06:00:00 +0000

Everybody loves to hate metrics. I get it. There are a lot of terrible metrics out there in the software development and testing world. People still propose counting commits or test cases as a measure of productivity. It’s garbage. But I also believe that measuring something can be a useful way to understand aspects of it that you couldn’t get with qualitative measures alone. I’m not going to give a defence of metrics in all cases here, but I do have a few suggestions for how to make them suck less.

1. Be very explicit about what a metric measures

To take the example of counting the number of commits a developer makes. It’s a terrible metric because commits aren’t actually a measure of productivity. While the platonic ideal of a commit is that it represents a single atomic change, the amount of work involved could still involve anything from a single character change to a large refactor of a highly coupled codebase.

Number of test cases run and the number of bugs found are equally bad metrics. Neither has an unambiguous way to be counted. Test cases can be broken up in all kinds of arbitrary ways to change their number. Meanwhile a single root cause might be reported as 8 different bugs across 3 application layers, either just because that’s how it manifested or because someone is incentivized to find lots of bugs.

There’s a very academic but interesting paper by Kaner & Bond all about rigorously asking what metrics actually measure. They propose a series of questions to help define metrics in a way that makes sure what you’re trying to measure is explicit. In my reading, the most important aspects of it boil down to making sure you have solid answers to the following:

Why are you measuring this? (If you don’t have a good answer, stop here.)
What attribute are you actually trying to understand? (e.g., productivity)
Is the metric you’re proposing actually correlated with that attribute? If the metric goes up, does that mean the attribute improved, and vice versa?
What assumptions are you making about how these two things are related?

In short, you want to be very explicit about why you’re looking at any particular metric, because it is very easy to track a metric that doesn’t measure what you think it does.

Another important question that Kaner & Bond bring up is: what are the side effects of measuring this? That leads us to the next important piece.

2. Have a counter-metric

One of the most common complaints about metrics is that they can always be “gamed”, or manipulated. If I start counting how many bugs people log, they’ll find ways to log more bugs. If I count commits, developers will make smaller commits. At their best, a metric will motivate positive changes, but it can always be taken too far. Goodhart’s law warns us that any metric that becomes a target immediately ceases to become a good target.

If we’ve carefully thought through the side effects of making a measurement, we should know how things might go wrong. While culture plays a big role here — e.g. by making very clear that a new metric is not a measure of personal performance, actually meaning it, and having people believe you — we can be more systematic about preventing this.

A classic example of counter-metrics from DevOps is that by pursuing more frequent small releases, a team might cut corners in testing. Less testing means you can release faster, but you could also see the quality of their product decrease by releasing bugs more often. This is why the “Big 4” DevOps metrics have two related to speed (release frequency and lead time) and two related to stability (how many production issues there are and how long it takes to recover from them). The stability metrics are there to make sure people don’t privilege speed at all costs.

(It’s also possible that doing less testing won’t result in more bugs; it is possible to over-test, after all. Pairs of counter-metrics aren’t guaranteed to be anti-correlated.)

It’s not always trivial to have a counter-metric. Counter-metrics are themselves metrics that will have their own side-effects. But for any metric you must ask: how will you know if it starts to do more harm than good?

3. Make them temporary

If you have a reason to measure something, there will be a reason to stop measuring it.

Generally, a metric either serves as a way to observe the effects of a change, or as an early warning system.

In the former case, once your goals have been met, think about getting rid of it. Make new goals and move on. Even if the goal hasn’t been met, examine why and re-evaluate. Avoid the temptation to make every metric a quota or target that has to be measured forever.

Any metric related to testing usually falls into this category for me; nobody should care how many test cases ran, because at the end of the day what really matters is whether the product is able to do its job for its users. You might pay attention to counting test cases because you have a hypothesis that changing that number will improve the resulting product quality. (The “how” here matters in practice, but let’s assume for a moment that you have a legitimate reason to make the hypothesis.) There are three main possibilities:

You succeed at increasing the number of test cases run before release, and product quality improved.
You succeed at increasing the number of test cases run before release, but product quality doesn’t improve.
You don’t succeed at increasing the number of test cases run before release.

In all three cases, you can stop worrying about how many test cases ran. The only case for keeping it around as a metric is in the 1st scenario so you can be alerted if the number regresses, but the longer you keep a metric like this around as a target, the more likely it starts being manipulated. The effects of Goodhart’s law are guaranteed to come into play eventually.

Bonus tip: Know who you’re talking to

A lot of my motivation for wanting to write this is that I’m a very quantitative person by nature. I have a hard science background and I like understanding things in terms of numbers where it makes sense to do that. If you’ve ever done corporate training on communication styles, you’ve almost certainly seen a 2×2 matrix dividing us all into 4 types of people. One quarter is always some version of “logical” or “analytical”, which is where I fall. These courses teach you that, even if you’re not that kind of person yourself, you’ll have more success communicating with people like that if you can put numbers on things. If you talk to someone in the opposing quarter — usually a Hufflepuff — you should leave the numbers out. Who is looking at a metric can be just as important as the metric itself.

What should the ratio of automated to exploratory testing be?

Gregory Paciga — Tue, 24 Aug 2021 14:00:00 +0000

I popped into an online panel about testing today, and a question along these lines was asked: what is the ratio between automated testing and exploratory testing at your company?

I get the gist of what is being asked here of course, and wouldn’t get too pedantic in answering it in the moment, but this is my blog and I can get pedantic if I want. The interesting thing to me about this question is that even confining it to a single company, single team, or single product, there are many different ways to answer it. The crucial point is that for you to measure a ratio, you need to attach the same units to both quantities.

One of my pet peeves is that tests aren’t countable things. Even strictly scripted manual tests usually have relatively arbitrary boundaries. Automated tests, especially UI tests, can typically be refactored in ways that change the number your tool reports at the end but contain the same set of assertions. (Even if you’re strict about saying one test has one explicit assertion, there are implicit assertions everywhere. Some tests might have 0 explicit assertions!) Even if you could count tests, how do you count exploratory tests in any comparable way? Each session? Each charter? Certainly not per assertion because a human will make a hundred different “assertions” in their head every minute. No, counting tests can’t be done.

Even asking about relative coverage of the two approaches requires some underlying quantifiable thing to cover. I generally work in an environment where implicit requirements far outnumber anything explicitly written down. You may have some set of user stories, BDD scenarios, or other itemized list you can go through, but it’s unlikely many of them fall squarely into automated vs exploratory buckets. (I’ve rarely automated tests for features I didn’t also explore myself.) Can you even say a user story is 80% tested with automated tests and 20% with exploratory? 20% of what?

Maybe you can try to use time spent. I should be able to roughly say how much time I spent exploratory testing, at least if I’m good about time boxing and not getting distracted too much. How do you measure the time of automated testing? Should I could how long I spent writing test automation, or running it, or analyzing the results? Let’s say, I typically spend 1 hour writing code and 2 hours doing an exploratory session, so the ratio would then be 1:2. Does that ratio tell me anything interesting?

One thing it certainly doesn’t tell me is how much of my testing work is done by automation, or the relative importance of those two activities. In that one hour I might have added a handful of scenarios to an enormous test suite that takes 8 hours to run. I don’t personally spend 8 hours on running that automation; I can do other things while the computer does that work. In that sense, investing more time into automation can actually decreases the automation-to-exploratory ratio. If automated tests are already quire thorough and robust, those tests alone could be enough to release and the exploratory work is gravy. At the other extreme, the automated tests might still be too immature to tell me much useful yet and the exploratory tests are the crucial part.

You could, perhaps, get into questions of how much time automated testing saves you compared to doing all of that manually. (Let’s hope that you would never actually try such a thing.) But then, we’re starting to talk more about automated vs scripted manual, and on how you might measure ROI of automation. Both are totally different beasts.

In reality, I expect most people would answer the question with some ratio of “human time” spent, and I’m thinking about it too much. Nobody asked for units.

The question is really asking about strategy: How do you balance these two activities to be most effective? There’s no one way to objectively measure that, as it depends on what you care about. Are you asking how much time should be spent on automation? Are you asking about relative importance? Are you asking about what can be automated or what should be explored? Are you just trying to plan how many people you need on a project? Your priorities will determine which way to quantify these things.