To Fix Or Not To Fix?: Another Good Question

(This post is the second in a trilogy that started with To Test Or Not To Test?)

One of the cornerstones of Extreme Programming is Defects Zero (like “inbox zero”, but for defects). A few years after XP started becoming popular I started hearing stories about teams that went months between defects or even never had defects at all. Such reliability is the result of a lot of hard work but it is achievable, and at negative cost (that is, it costs less over the long run to develop under Defects Zero).

An important step on the road to Defects Zero is developing an intolerance towards open defects. Periodically someone comes onto the XP mailing list and asks, “We are starting our first XP project. What defect tracker should we use?” No one seems to like the answer, “Don’t have any defects to track,” but that really is how some teams work.

Working under Defects Zero feels great. Nothing hangs over your head. If you are adding a feature, there’s no defect you “ought” to be fixing instead. Interruptions fall dramatically.

But on the runway…

Defects Zero is a strategy for maximizing throughput. When a defect is reported, the team tests for it, fixes it, performs a root cause analysis, and remediates the source of the defect. While this is an immediate investment, it pays off over time. The team goes faster when fewer defects are being reported in the first place and when they can trust in their tests to tell them if the system is healthy.

One of the adjustments I had to make when I dove back into startups was realizing that Defects Zero, with its focus on throughput, doesn’t make sense for projects on the runway, when latency is key. On the runway, the key is how many questions can you ask and answer, how many assumptions can you validate per unit time (and dollar).

I’ll provide an example from my latest project, tentatively named “Tattlebird”. The assumption was that interesting stuff was happening in all those browsers out there, stuff that developers would like to know about. What was the quickest way to validate this assumption?

Because of browser differences, writing a cross-browser data gatherer is a lot of work. Fortunately, the data gatherer simple is simple in Firefox. In half a day I had a data gatherer that worked only in Firefox. That means it would fail more than half the time (update–it silently doesn’t send any data). I deployed it. Nothing happened for the first five minutes so I took a nap. An hour and a half later there was a river of information of exactly the type I hoped would be there. Assumption validated.

In a Defects Zero world, I would next make the gatherer work for all the other browsers. I don’t want to have to worry about someone calling and saying, “Your system doesn’t work with my (obscure) browser.” However, I have a more valuable use for my time: validating the next assumption, that developers will find this interesting information concretely valuable. How am I going to test that? The current data stream is more than sufficient to answer that question.

I don’t like working with defects piling up, but the economics of my current situation are that for the moment validated assumptions are worth a lot and reliable software is only worth a little. If I could find a win-win solution, like using a solid cross-browser library, I would happily do so. However, for the moment latency holds the reins.

Managing a Defect Backlog

I used to know how to manage a defect backlog: severity, frequency, etc. Turns out I’ve forgotten. I vividly remember a panicked phone call with Eric Ries where he talked me off the figurative ledge. Here’s what worked for me.

The key is treating defects as simply information, not automatic triggers for action. For project management purposes, ignore individual defect reports and only deal in aggregates. Use a simple classification scheme, for example by scanning defects with regular expressions. Looking at, “48 A, 23 B, 17 C, 24 others” once a day is much less likely to induce panic than having the defect siren go off every ten minutes around the clock. In the cruise phase each defect is accompanied by the sound of evaporating money. On the runway defects are part of the price of feedback.

In this scheme it is changes that are worth noticing immediately. A never-before-seen defect is worth triaging, a large change in the proportions or the absolute numbers of defects likewise. As long as defects don’t interfere with feedback from users, though, more attention than this is counter-productive.

It is the users, both their character and their number, that make this system work. Early customers are, well, early. They are willing to endure a little pain in order to use the latest cool system or to solve a previously unsolvable problem or just to have their voice heard. In addition, there just aren’t that many of them. As a recent post from 37 Signals said, after you’ve scaled there are no corner cases. You can walk one or two people through a workaround, but you’re better off fixing a defect, even an expensive one, rather than try to walk a thousand people through the same process. On the runway, users with defects need to be heard, not necessarily served.

Conclusion

When I noted that tests needed to be used thoughtfully on the runway I was accused of abandoning my principles, of having no pride, of not being a craftsman. None of these is true, not of testing, not of defects, and not of (coming next) design. The higher principle I follow is to create as much value with my skills and talent as I can. In an environment of great uncertainty and minimal resources, when latency is key and the half-life of code is short, that means testing less and carrying defects. When the code’s half-life stretches out and throughput becomes the business driver, Defects Zero makes perfect sense. The transition between the two styles, especially coming as it does during the stress of the climb phase, is a challenge. Being able to develop with a defect backlog, being able to maintain Defects Zero, and being able to move from the former to the latter are all valuable skills for a complete developer.

7 Comments

[...] To Fix Or Not To Fix?: Another Good Question Defects Zero is a strategy for maximizing throughput. When a defect is reported, the team tests for it, fixes it, performs a root cause analysis, and remediates the source of the defect. While this is an immediate investment, it pays off over time. The team goes faster when fewer defects are being reported in the first place and when they can trust in their tests to tell them if the system is healthy. [...]

Sebastian KübeckAugust 6th, 2009 at 5:04 am

Great post! You could argue in the same direction with lean: Defects are waste. They don’t add any value to the customer and fixing them is expensive. Conclusion: get rid of them.

One question: Suppose there is a team in transition to XP with a product in the cruise. They are working on a legacy system without tests and loads of bugs and feature requests. What options to they have to archive Defects Zero?

Bob MacNealAugust 6th, 2009 at 10:13 am

Given programmers have finite capacity, often the choice is simple:
implement new features OR fix bugs (…or build a framework that converges toward Defects Zero).

If you’re piloting this startup down the runway, you get to decide the value of those choices.

Once you have a v1 product, do you agree with Anna Forss in a recent blog post where she says “Bugs create more detractors than features create promoters”?

KentBeckAugust 6th, 2009 at 11:04 am

Bob,

I think the cost of defects depends on the number and attitude of the people affected by them. If a few knowingly bleeding edge users encounter a defect, it doesn’t cost the company in the long run as long as the users can still get their needs met soon. If someone pays to just use the service and they encounter a defect, it is much more expensive. As I said at the end of the piece, the transition from managing a defect backlog to Defects Zero is hard but necessary.

Tathagata ChakrabortyAugust 7th, 2009 at 6:34 am

A slightly different perspective (and a naive question) – if number of defects start increasing (or rather say the rate of defects increases) doesn’t that indicate that you are going too fast even when you are on the runway, and you might never be able to come back to Defects Zero? Does it make sense to track your defect rate?

Herb LainchburyAugust 20th, 2009 at 10:11 am

Great post. I also find it useful to distinguish my project components. The components are separated into two groups, common (i.e. framework and framework enhancements) and experimental (i.e. custom code for this experiment). Defects in common code are worth fixing because I use that code in many experiments and doing so improves all of my experiments. Fixes to defects found in code classified as experimental are optional.

[...] adhering to principles or practices. So I agree with what Kent Beck point out in a recent article To Fix Or Not To Fix?: Another Good Question When I noted that tests needed to be used thoughtfully on the runway I was accused of abandoning my [...]