Is Your Team as Resilient as Your Software?

How to analyze our organizations the same way we analyze software.

Rasmus Feldthaus
The Startup

--

Photo by Olga Guryanova on Unsplash

Companies spend lots of resources to keep their software systems highly available and resilient. But what about the teams building it?

Most software developers face the daily challenges of developing complex systems. We need to consider what level of uptime, maximum latency, redundancy, and also how fine-grained we can afford to let the system degrade when issues occur.

The design choices we make, ultimately imply which trade-offs we can accept in our systems. We also discuss these issues openly and come up with viable solutions.

So why don’t we try to do the same analysis on the organizations we are working in? It may reveal some interesting findings. After all, if Conway’s law holds, and all the software we produce mirrors the organizations we work in, then maybe it makes sense to go the other way and analyze the organizations we work in, using techniques we apply in software engineering.

Identifying a single point of failure

A single point of failure in a software system is usually a component, which on failure will take the rest of the system down with it. In your team, this will be your coworkers or subordinates depending on where you reside on the corporate ladder.

How critical would it be if any of the people on your team does not show up for work the next day?

Is Bob the only one who knows how intricate parts of the system works, or the only one authorized to access certain parts of the system? In that case, Bob is your single point of failure. If Bob no longer shows up for work, then your team will be unable to perform certain tasks.

Now wait a minute, you may say. Bob is a good colleague, he is not going to suddenly drop like the connection to a database. Should he decide to pursue a career elsewhere or retire, we can arrange a handover session before he leaves. This may very well be true, but unfortunately, people do have accidents, and people do become critically ill from time to time.

The difference between the two scenarios is what we in software systems call scheduled versus unscheduled downtime. For critical software systems, we attempt to account for both, why should we not strive to do the same when organizing our teams?

One of the tools, we normally tend to use to tackle single points of failure is adding redundancy to our system. We can do the same for our teams. Unique knowledge can be distributed out through processes such as pair programming, having a round-robin scheme where each developer in turn is responsible for supporting all parts of the software the team is responsible for, frequent knowledge sharing, etc. Authorizations can be given to multiple team members etc.

One of the prices paid for adding redundancy to a system is usually extra costs. In software systems this mostly shows as extra VM’s, licensing costs, and sometimes performance impacts. In most organizations, the costs will be on productivity. To put it simply you cannot work on your own tasks, while pair programming with somebody else. This is an important point. In order to have a constructive discussion about the desired level of redundancy, you must also be honest about the costs associated with it. Telling your team to implement these measures and not allocating the time, is the same as ordering from the menu, but not paying the bill. As with most other things in life, there is no free lunch.

Graceful degradation

Graceful degradation is the art of prioritization. In computer systems, it is deciding what gets priority when resources become limited, or the system is exposed to exogenous shocks.

So, how does this apply to teams? It happens more often than you think. Layoffs, wherefrom one day to the other, a significant portion of your coworkers are gone, but your backlog remains the same. Suddenly everyone is working home because of the Covid-19 pandemic. New legal, or third party requirements with a short deadline emerges, and the list goes on.

The tools we normally reach for when designing software systems include not operating at peak capacity, priority queues, and backpressure. We can apply the same principles when organizing teams. When it comes to not operating at peak capacity, it does not mean that everyone should only work a certain fraction of their full hours. But hopefully, you have allocated time for refactoring code, time for improving continuous integration, testing, and so on. This is time you can temporarily reallocate for something else, to absorb small shocks.

Prioritization is even more important. If half of your team is suddenly gone, there is no chance you can fulfill every previous promise to deliver. You must decide what will be done, what will have to wait, and what will not be done at all. This is where backpressure techniques become handy. This is essentially where system A signals to system B that it will not process any extra work for system B. In your organization, this means informing stakeholders and other teams, what will not be done, and what will be postponed. While this is not a fun conversation to have, it leaves the option for the other stakeholders to investigate possible workarounds and leaves them better off prioritizing their work.

One strategy that is definitely not recommended is overpromising and underdelivering. This corresponds to getting timeout errors to your requests while having no idea which requests went through and will be processed, and which ones won’t. Finally, as a rule of thumb, if you are not actively handling these issues, you are most likely overpromising and underdelivering.

Horizontal scalability

In software systems, horizontal scalability can be summed down to scaling the system by adding more machines, rather than bigger machines. In a team, this translates into the ability to onboard new team members and getting them fully operational and productive. Just like with computer systems, teams are subject to the Universal Scalability Law. This means that at some point the costs of maintaining coherence and consistency start to dominate, and adding new members will have a negative impact.

The tools we normally strive for here is reducing the need for coordination by identifying which tasks that can be done with autonomy and changing the topology of the architecture, to keep the need for coordination to a minimum. In teams, this generally translates into keeping teams small, and splitting them when they become too large, and making sure a competent person is in charge of every team. A classic example of divide and conquer.

This also implies having good documentation, on both the technical parts, but also what constitutes core domain knowledge in that field. This way new hires will have the opportunity to familiarize themselves with everything with a higher degree of autonomy. Further, this means having an easily accessible local test-environment that is hopefully pre-deployed to the machine of the new team member. We also tend to favor asynchronous communication over synchronous communication, so consider reducing the number of synchronization points in your workflow. These are points where your team members are blocked in their workflow waiting for others.

A side note about the Universal Scalability Law and team productivity. I sometimes hear the argument of self-managing teams, and that it does not make sense to put a person in charge of a small team. One could make that argument, but expect coherence and consistency penalties to increase quite a bit. A new task comes in. Who does it? Can it wait? What was I doing? As developers, we know multitasking and concurrency are both difficult and comes with quite a performance penalty. Leaving it to chance works surprisingly well from time to time, but is quite often disastrous in the long run. This is a point where machines and humans are remarkably similar.

Eventual consistency

Eventual consistency is a big topic in distributed systems. The basic idea is that while the system may not have a consistent state at all times, eventually the system will converge towards a consistent state if no new changes are applied.

When you think about it, most organizations operate in a constant state of inconsistency. A sale is made before the payment is processed and booked in the back office. But eventually, everything will fall into place. So does most teams in the organization. An interesting thing to investigate for a developer team is your codebase and continuous pipeline. How confident are your shipping the code in various stages of your pipeline? How quickly can you verify that nothing that isn’t quite ready is going out to end-users, will not make it there? How fast can all of your changes, branches, features people are working on be gathered into a single coherent unit that everyone (including end-users) are happy shipping.

By asking these questions, you can get an idea of what you should improve on and what works well.

Final thoughts

Traditional organizational behavior and management techniques, all provide valuable inputs on how to get the most out of your teams. But I hope to have shown that the practices we apply in software engineering may provide valuable insights on how your team is organized, compared to what you strive to achieve. And more importantly, help you identify weak points and provide a way to improve on them. Another advantage especially for teams of software engineers is that you already share a common language for talking about the structure of your team and the trade-offs you have.

Regardless of your background, considering your team as something you will have to engineer, provides an interesting fresh view for your next retrospective.

--

--

Rasmus Feldthaus
The Startup

Software developer, with a background working in the financial industry. Writes about software development, and other stuff I find interesting.