Large numbers of failed tests are rare
What is the lifetime cost of developer tests? Do they smooth development into the future or are they like mud, clinging ever thicker to your boots until you can’t move at all?
I used a data set from DevCreek to explore this question. DevCreek is a site that helps programmers gather statistics on their development activities. The data in this set comes from the programmers at Intelliware, a Toronto software consultancy and the developer of DevCreek. Their data in DevCreek covers almost 100 person-years of development.
DevCreek gathers data on testing by recording the results of every test run during coding. There are nearly 300,000 test runs in the data set. Mining this data provides clues about the cost of tests.
Data
Tests that run incur only the cost of waiting for the test to finish (this can add up, but that’s the topic of another post). It’s the tests that fail that require programmer time and attention. So, how many tests fail on average every time a programmer presses “Run”? Actually, that’s a trick question. A large majority (85%) of test runs have either zero or one failure. Plotted on a log-log scale, the number of failures per run forms a fairly straight line, characteristic of a power law distribution:
The data show that occasionally large numbers of tests do fail (5049 was the most tests that failed in a single run). This doesn’t necessarily mean that such jackpot test runs are expensive to fix. There could be a single cause for all the failures. More analysis of the data is necessary to see what happens downstream of large numbers of test failures.
Perhaps the number of test failures is limited because people aren’t running many tests. In that case the percentage of test failures should be relatively constant. Here is the graph of test failure percentages:

The big spike at 100% failures is mostly (55000 out of 63000) caused by test runs with a single test that fails (common in test-driven development). Even the little blip at 50% is mostly caused by test runs with two tests, one of which fails. Excluding test runs that pass, the percentage of test failures is evenly distributed.
Conclusion
Snow avalanche magnitude follows a power law distribution. When planning for avalanches, the important question is, “What is the worst avalanche we’re likely to see in the next 100 years?” The average avalanche magnitude is not helpful in answering this question. Instead, you need to know the slope of the line in the log-log graph. Following the slope out to the end gives you a clue about the maximum magnitude. As reported in the referenced paper, the slope varies by location from 2 to 5. Data sets from each location could have the same average magnitude , but the likely maximum magnitude would be dramatically different.
Similarly, in speaking of test failure rates the average number of failed tests is not a helpful measure. Averages are misleading in power law distributed data because the outliers have a disproportionate impact on the result. To characterize the data and draw conclusions about the future you need to begin with the slope. This helps predict how frequent large numbers of failures will be and what is meant by “large”. Tracking changes to the slope over time could give you clues to the health of the system.
The original question remains unanswered so far. The number of failures is a crude approximation of the overall cost of keeping tests passing. The MTBF of tests and the time required to get a test passing once it has failed would be useful measures to get closer to a fact-based estimate of cost.
A potential limitation in this analysis is the small data set. The programmers at Intelliware have been practicing developer testing for many years. To be able to make broad conclusions, data sets from a variety of development contexts, programming languages, and problem domains need to be analyzed.
Appendix: Test Count versus Test Failures
Another way to look at the relationship of total test count in a run to the number of failures is to plot them independently. One pillar of the “tests are costly” argument seems to be that the cost (approximated here by the failure rate) of tests are constant–more tests, more cost. Here is a 3D visualization of tests per run versus failures per run:

Most of the time test runs succeed, or a few tests fail, regardless of the size of the test suite. The only exception is for runs with a single test, where 2/3rds of the time the sole test fails. This is characteristic of test-driven development (in my own practice I usually run more than one test at a time). In any case, more tests does not equal more failures.

Interesting data. Way back when, I wrote a paper with the premise that a test delivers value only when it finds a bug. Its cost was the cost to create it + the cost to rewrite it when the behavior it depends on legitimately changes. Back then, the time to write or rewrite a test (especially a through-the-GUI whole-product test) was high, the chance of the test breaking spuriously (not because it found a bug) was also high, and the chance of a test producing value was low. In particular, the chance of a test finding a bug *after* the first time it was run was low. The tradeoffs tended to push whole-product testing towards manual, one-shot, never-to-be-repeated tests.
Is there any way, given the data, to separate out tests that failed because an API changed from tests that failed because they found a bug?
(Since I wrote the paper, we’ve discovered other sources of value from tests. When it comes to unit tests, we’ve drastically reduced both the cost of writing them and the chance of them spuriously breaking. I still have my doubts about acceptance tests.)
Paper: http://www.exampler.com/testing-com/writings/automate.pdf
Fantastic post, Kent.
It makes me want to create plugins for common CI systems to anonymously upload stats. I imagine a lot could be learned from mining data. There are a lot of things the Agile community believes about software projects that would be great to prove or disprove.
Cascading failures or the perfect bug storm come to mind when looking at the graph. I once caused a massive failure of the apache software test suites by removing a deprecated method from a popular unit testing framework… there are value able lessons to be learned during such extremes, assuming survival.
A thoughtful analysis. I wonder what the data says about how the mean-time-to-repair varies by number of tests or the number of failures. That might get at the burden imposed by poorly-written tests, though distinguishing test issues from production code issues is probably beyond the scope of the data you have. Still, it might make for an interesting further analysis.
I also wonder how many outliers there are that can be explained away as “leaving a failing test” to provide focus the next morning.
A More Scientific Approach to the Costs of Testing…
Sometimes people level the “tests are too expensive” argument. Kent Beck (Mr. Extreme Programming) provides a refreshingly scientific journey into why the lifetime cost of tests is small. One main point is that most failed test runs are caused by……
I’m most curious about the transitions. I’m sure 0->1 and 1->0 are high (given their dominance) but I’m also curious what others show up (and especially what happens to the high-failure-count ones – do they get fixed right away?).
It also might be interesting to look at “chains” of failures – MTTR (‘mean tests to repair’ to steal an ETLA:) – how long are the chains from 0 back to 0?