Latency, Throughput, and Variance

We visited Gothenburg, Sweden this year where I spoke at the Swedish Developers Conference. The hotel we occupied had an elaborate double revolving door in the lobby, necessary to deal with the Swedish winter. Watching people interact with the door got me thinking about software development (pretty much anything can get me thinking about software development). Individual people would come up to the door and pass right through. Groups would bunch up at the door, but eventually they would all get through. If too many people queued up, they would start using the regular swinging doors to either side of the revolving door and let the warm air out.
The next week I was giving a workshop at SKB Kontur in Ekaterinburg, Russia. I needed to demonstrate systems thinking, so I pulled out the revolving door example. As I started analyzing the situation I first made the distinction between throughput–the number of people going through the door per unit time–and latency–the average time required between when an individual walked up and when they got through the door. The third variable that came up was variance–the distribution of latency and throughput. The explanation seemed to help, so I thought I would repeat it here.
Three Variables
There you have it, the three process variables–throughput, latency, and variance. These three process variables complement the “Iron Tetrahedron” of product variables–scope, quality, cost, and time. They are a way of looking at how the slices of the product are delivered. They constrain and affect each other, sometimes reinforcing, sometimes inhibiting, and sometimes working in different directions at different time scales.
The importance of the three process variables was brought home to me this week when a friend told me about a dilemma at work. Colleagues had explained that they could deliver the code or the tests but not both. They just didn’t have time. My friend’s perspective was that tests help him deliver faster. When others code without automated tests, he ends up paying the price. As I see it, he and his colleagues have different views on the importance of the variables, their relationship, and the time scales about which they need to be concerned. Not that this resolved the conflict, but it’s useful to see those you are in conflict with as having a different perspective, not just being stupid and recalcitrant.
The Latency Trap
My friend’s colleagues had fallen for the Latency Trap (or at least didn’t care about its long-term effects). In the Latency Trap, the developer thinks, “I need to get this done quickly, so I’ll cut back on testing and design.” Over time, this creates unpredictable problems (increases the variance). If the developer loses the bet, this variance lengthens the time required to complete a task so much that he would have been better off to invest in testing and design. Here is a diagram of effects for this dynamic:

Lower latency leads to higher variance leads to higher latency
Reducing the latency increases the variance which increases the latency. The harder you push on a negative feedback loop like this, the harder it pushes back. Trying to reduce the latency further (after all, you’re late after the first time through the cycle) increases the latency further.
There are times when you can avoid the jackpot at the end of the Latency Trap. Reducing latency becomes a valuable tool in the earliest stages of a startup. If 80% of your ideas are going to die as soon as you discover nobody will pay for them, the “later” of “pay me later” never comes. You can afford to revisit the features you rushed through the first time, or rather if you hadn’t rushed through all those features in the first place you couldn’t afford to revisit anything because you’d be out of business. Noticing when this survival dynamic no longer holds and shifting to a more sustainable model seems to be a challenge. Ideally you’d be able to the model business conditions demand.
The Throughput Trap
Another dysfunction comes when a developer thoughtlessly tries to do more work in the same amount of time. “Work smarter, not harder,” often means, “work faster, not slower.” Duh. Again, scrimping on testing, design, and automation results in higher variance, and higher variance destroys throughput.

Higher throughput leads to higher variance leads to lower throughput
Again the negative feedback loop, again the unintended consequence. Ouch.
The Variance Trap
If variance is the key, why not drive it to zero? That should improve throughput, anyway. That’s the approach taken by the style of software development relying on one-shot speculation for planning. Answer all the hard problems early and all that’s left are the easy problems at the end. The problem is that variance is a balloon. If you squeeze one end of the balloon, the variance will pop out at the other. Latency suffers when you have a big up-front planning process. More importantly, though, there are significant sources of variance that are only revealed with time and experience. A process that focuses on eliminating variance isn’t prepared to deal with variance that drops in for a visit unannounced.

The picture above is wrong. Lower variance leads to *higher* latency leads to higher variance
Trying to reduce variance increases latency which (by reducing feedback) increases variance. Another negative feedback loop. Is software development doomed?
The System
I don’t think so. The problem comes from trying to improve one of the process variables in isolation. The three form a system, with feedback loops that can be arranged and driven to positive or negative ends. Software development can have high throughput, low latency, and low variance, but only by paying attention to the system the variables form.
For example, you can reduce average latency by delivering a slice of functionality to production weekly. The size of the slice is no more than can be carefully designed, tested, and deployed. This gets some features out soon, some out even at the end of the first week. The feedback from those features goes into reducing variance, to stomping out surprises when they are little and fit under your feet.
Delivering weekly slices cuts into throughput, no question about that. The time it takes to deploy a feature could be spent implementing the next feature. However, over time a transformation takes place. The reduction in variance means that the team isn’t interrupted so often. Their investment in design and testing starts to pay off by reducing the amount of work required for future features, and throughput increases.

Reducing latency results in higher throughput
While the short term effect on throughput is negative, the system as a whole delivers better throughput through reduced latency.
You can drive a similar virtuous cycle by increasing throughput. Automation is one consistent habit of effective teams. Automating a task increases throughput long-term at a short-term cost in latency (doing a task by hand is almost always quicker than automating the same task). Over time, though, automation reduces latency by executing tasks faster than they would have been done manually. An automated task eliminates variance, which further increases throughput and reduces latency.

Increasing throughput results in reduced latency
Responsible Development
The principles and practices of XP/Responsible Development suggest ways to drive the loops between the process variables in positive directions. Continuous integration, continuous testing, and continuous deployment all aim to simultaneously improve latency, throughput, and variance over the medium to long term. The zero defect policy is intended to encourage early learning, thereby reducing variance and improving throughput and thus latency.
The process variables and their interactions are facts of life as a software developer. What is fully under our control is how we respond. By paying attention to them as a system we can get more features through the doors more predictably in less time .
Nice example, but I’d argue that the “latency” and “throughput” in your example are perceived rather than actual – developers may think they’ve finished a particular task, but if they haven’t done it properly they haven’t really finished it after all. Like when you go through the revolving door but you’ve left one of your bags outside on the pavement and have to go back outside again.
I often draw system diagrams where I show how “perceived” and “actual” get out of synch, and how you could bring them back together. But I admit – these diagrams can get more complex than yours.
EXCELLENT article. I’m sharing it with the entire product development team at my company!
We’re very heavily focused on increasing thoroughput without considering variance or latency.
Interesting thoughts. Kanban advocates like David Anderson and Karl Scotland think about this stuff a lot, and would argue that reducing variance first is the key – in the end, the main thing the customer / business wants is *predicatbility*, and achieving that predictability gives the team the peace and stability they need to gradually pick up their pace.
By measuring cycle time (latency) and variance of that cycle time and aiming to reduce both while imposing a tight limit on the amount of work in progress, I think Kanban adds some really useful tools to the ones in the XP box.
[...] 11, 2009 in Uncategorized I read this plog post a while back and it struck me as having a lot af lean elements to it. I see a lot of [...]