Amazon: a case study on combatting bureaucratic drag

through separable, autonomous and focused sub-systems

Nov 21, 2022

In the past few essays, I have described how all codified, legible systems become more complex, kludge-ridden, dysfunctional and immobile over time. Although I have focused on government organisations such as the NHS so far, private enterprises suffer from the same problems.

In Chapter 3 of their book ‘Working Backwards’, Colin Bryar and Bill Carr describe the evolution of Amazon’s processes and its attempts to fight the problem of increasing “bureaucratic drag”. From the mid-1990s till the early 2000s, Amazon grew at an explosive pace and transformed from a small book retailer to a large multinational organisation. Any organisation that grows in this manner becomes less and less agile due to the “ layer upon layer of permission, ownership, and accountability” that builds up over time. Amazon faced the same challenges, and soon, they were “spending more time coordinating and less time building”.

Buildup of Dependencies

As Amazon grew, more features were added to the platform and

more features meant more software, written and supported by more software engineers, so both the code base and the technical staff grew continuously. Software engineers were once free to modify any section of the entire code base to independently develop, test, and immediately deploy any new features to the website. But as the number of software engineers grew, their work overlapped and intertwined until it was often difficult for teams to complete their work independently.

This meant that any new project had to deal with an increasing number of dependencies, “something one team needs but can’t supply for itself”. Most of the dependencies were a function of the monolithic structure of the codebase. For example, any changes to the database needed to be approved by a steering group which is understandable and unavoidable given the system-critical nature of the database. However, Amazon also suffered from equally debilitating organisational dependencies.

Our organizational chart created extra work in a similar fashion, forcing teams to slog through layers of people to secure project approval, prioritization, and allocation of shared resources that were required to deliver a project. These organizational dependencies were just as debilitating as the technical ones.
Just like our software, many of our org structures had become tightly coupled and were holding us back.

It takes very little to produce crippling organisational dependencies over time. Increased specialisation and the creation of organisational units to deliver separate tasks are sufficient, even in the absence of a significant burden of code and software.

Predictably, this buildup of dependencies meant any new project faced multiple bottlenecks that were outside the control of the team responsible for the project, and this led to disempowerment and frustration at the slow pace of change and innovation. To Amazon’s credit, they identified the problem and tried multiple solutions to tackle the problem. They also figured out along the way what does not work.

Better coordination is not the answer

The more dependencies between teams, the more coordination is needed. However, as Jeff Bezos correctly identified,

all this cross-team communication didn’t really need refinement at all—it needed elimination
Jeff’s vision was that we needed to focus on loosely coupled interaction via machines through well-defined APIs rather than via humans through emails and meetings. This would free each team to act autonomously and move faster.

The aim is to “coordinate less and build more”. This is a vitally important principle that most organisations fail to understand. When implementing a new project that requires expertise across domains and organisational silos, the first instinct is always to set up inter-departmental teams, more periodic meetings etc. However, all that better coordination achieves is more bureaucracy and less work done.

The Soviet Solution: Centralised Planning

Amazon’s first solution to the dependency problem was a process called New Project Initiatives (NPI), which tried to globally prioritise and rank all projects and resource allocation to these projects.

Here’s how NPI worked: Once every quarter, teams submitted projects they thought were worth doing that would require resources from outside their own team—which basically meant almost every project of reasonable size. It took quite a bit of work to prepare and submit an NPI request. You needed a “one-pager”; a written summary of the idea; an initial rough estimate of which teams would be impacted; a consumer adoption model, if applicable; a P&L; and an explanation of why it was strategically important for Amazon to embark on the initiative immediately. Just proposing the idea represented a resource-intensive undertaking.

The NPI process was essentially the Soviet approach to centralised planning. Most people’s idea of the Soviet planning system is one where central Soviet ministries simply hand down unrealistic targets and projects to firms and managers. This may have been the case during the early Stalinist days, but most post-WW2 Soviet planning was a two-way process, just like the NPI process was. For example, the process of determining the price of a new product started with a proposal presented by the enterprise to its ministry and the relevant committees.

This process of centralised control has many flaws, not least because it relies on the centralised approvers’ ability to predict which of the submitted projects will succeed. Moreover, over time, NPI requests will be tailored more to maximise the probability of approval than to maximise the probability of success, just like Soviet managers were eventually only concerned with satisfying their ministry and the control process rather than the true performance of their enterprises or its customers.

Better people and more effort can make things worse

We often assume that bureaucracies and organisations function poorly because the people within them are incompetent or lazy. In fact, precisely the opposite is true. Once ‘dependency hell’ has set in, better people working harder not only fails to solve the problem, it worsens it.

It’s not that the participants in the NPI arena—or the DB Cabal for that matter—fell short or had nefarious motives. They were all top-notch, talented, hard-working people who were swimming against a riptide of dependencies. If you’re faced with a challenge that’s growing exponentially, meeting it head-on with equal but opposing force just locks you into exponentially growing cost—a dead-end strategy.

In fact, the system may only function if rules are ignored, and the people within the system are ‘careless’.

Autonomy without power is not enough

Amazon’s next solution to the problem was to set up “Two-Pizza teams”, teams of no more than ten people that are autonomous and evaluated by a pre-defined “fitness function”, the function being a “sum of a weighted series of metrics”.

This solution ran into two problems. First, it is easy to want to be autonomous, but how can any team be autonomous if it has to deal with multiple organisational and software dependencies without which it cannot achieve its goal? And second, the introduction of the fitness function only served to increase the complexity of the process.

Replace the existing system piece by piece

The first step was, therefore, to replace the monolithic software architecture with a service-oriented architecture with multiple isolated services where external access to each service was only possible via a well-documented API. The monolith was replaced piece-by-piece whilst ensuring that it “continued to stand until its last surviving function had been replaced by a service”.

More complex control does not work

It is a universal truth that all control processes tend toward increasing complexity. The most obvious manifestation of this tendency is the use of increasingly complex metrics to control the system. Just like the Soviet planners moved from simple output targets to complex targets involving innovation, capital usage and ‘profits’, Amazon tried to evaluate projects via complex fitness functions with “some of these overly complicated functions combined seven or more metrics, a few of which were composite numbers built from their own submetrics”. However, just as the Soviets eventually realised, Amazon too realised that more complex control is not better control.

We eventually reverted to relying directly on the underlying metrics instead of the fitness function. After experimenting over many months across many teams, we realized that as long as we did the up-front work to agree on the specific metrics for a team, and we agreed on specific goals for each input metric, that was sufficient to ensure the team would move in the right direction. Combining them into a single, unifying indicator was a very clever idea that simply didn’t work.

Single Threaded Leadership: Separate, Autonomous and One Aim

In the end, Amazon chose a process that it called “single-threaded leadership”, in which

a single person, unencumbered by competing responsibilities, owns a single major initiative and heads up a separable, largely autonomous team to deliver its goals.
Separable means almost as separable organizationally as APIs are for software. Single-threaded means they don’t work on anything else.

In a sense, this creates new small, autonomous systems within the larger system rather than retrofitting or reforming the larger complex system. Sufficient autonomy is critical, i.e.

can the team build and roll out their changes without coupling, coordination, and approvals from other teams? If the answer is no, then one solution is to carve out a small piece of functionality that can be autonomous and repeat.

The team and its leader must also have one and only one clearly defined aim (thus avoiding the need for complex control). If the initiative is one of many that the leader is responsible for, then it will fail. As Dave Limp put it, “the best way to fail at inventing something is by making it somebody’s part-time job”.

Our world and its systems are too complex, but this problem is not insurmountable. The fundamental principle behind all solutions to this problem is to grow, adapt and innovate via new subsystems/services rather than adding to an already dysfunctional and rigid system. However, there are no final solutions and magic bullets here - for example, a service-oriented architecture can and often will reach the same level of crippling dependency hell over time.

In the long run, a resilient macro-system requires that its micro-systems be allowed to fail and collapse. Macro-resilience requires micro-fragility.

macro-resilience

Discussion about this post