How to improve software delivery performance
Many teams have a desire to be more agile in the way they operate and deliver products and projects. There are many stories of large organisations who attempted an Agile transformation and after spending a lot of effort and resources into making it successful, it still failed.
This article is derived from DORA’s State of DevOps research program that is representing six years of research and data from over 31,000 professionals worldwide. It is the longest running academically rigorous research investigation of its kind, providing an independent view into the practices and capabilities that drive high performance in technology delivery and ultimately organisational outcomes. This research uses behavioural science to identify the most effective and efficient ways to develop and deliver software. So let’s get started!
A key goal in digital transformation is optimising software delivery performance, and here is how and why.
Software delivery and operational performance
Organisations increasingly rely on their ability to deliver and operate software systems to achieve their goals and the industry needs a way to measure the effectiveness of their development and delivery practices. DORA’s research program has identified 5 metrics that provide high-level systems view of software delivery and performance and predict an organisation’s ability to achieve its goals. These 5 metrics are called software delivery and operational (SDO) performance which are explained in detail in ‘Most effective and efficient way to develop and delivery technology.’
In this article we shall focus on how the following capabilities positively impact SDO and productivity in an organisation. You can expect to read about:
- Cloud: how you implement cloud infrastructure matters
- Loosely couple architecture
- Code maintainability
- Continuous integration
- Continues development
- Disaster recovery management
- Change management
- Culture of psychological safety
- Useful, easy-to-use tools for increasing productivity
- Internal and external knowledge sources
- Technical debt
Improving SDO and organisational performance
Your starting point is to focus on foundations: Basic automation (such as version control and automated testing), monitoring, clear change approval processes, and a healthy culture. Then identify your constraints to plan your path forward. This strategy works for those just beginning transformations as well as those who have been optimising for years. Focus resources on what is currently holding you back, then iterate: Identify constraints and choose the next target.
Have a look at the model below and choose a goal that you want to improve on and identify the capabilities that impact it. For example, if your goal is to improve software delivery performance (common goal), these capabilities are culture of psychological safety (trust), clear change process, continuous delivery, and cloud. Then focus on those that are your biggest constraints.
It could be that the change process is the biggest bottleneck in your organisation or you are sensing employee engagement is at an all-time low. In the next section we are explaining these capabilities in a little more detail.
Cloud: How you implement cloud infrastructure matters
With the evolving nature of business, more and more organisations are choosing multi-cloud and hybrid cloud solutions. This is because these solutions offer flexibility, control, and availability in addition to performance gains. Having said that, what really matters is how teams implement their cloud services, not just that they are using a cloud technology.
Elite performers were 24 times more likely to have met all essential cloud characteristics than low performers. This may explain why teams and executives who claim to have adopted cloud computing technologies also feel frustration at not reaping the promised benefits of speed and stability: Many survey respondents, interviewed by DORA’s State of DevOps 6 year long research program, who claim to be using cloud computing haven’t actually adopted the essential patterns that matter. So what patterns in cloud computing are we actually referring to?
- On-demand self-service. Consumers can automatically provision computing resources as needed, without human interaction from the provider.
- Broad network access. Capabilities can be accessed through heterogeneous platforms such as mobile phones, tablets, laptops, and workstations.
- Resource Pooling. Provider resources are pooled in a multi-tenant model, with physical and virtual resources dynamically assigned on-demand. The customer may specify location at a higher level of abstraction such as country, state, or datacenter.
- Rapid elasticity. Capabilities can be elastically provisioned and released to rapidly scale outward or inward on demand, appearing to be unlimited and able to be appropriated in any quantity at any time.
- Measured service. Cloud systems automatically control, optimize, and report resource use based on the type of service such as storage, processing, bandwidth, and active user accounts.
These characteristics matter when defining what it means to adopt cloud computing because they enable an actionable strategy for success and improve SDO performance. By focusing on execution in the cloud any team or organization is capable of reaping the benefits of speed, stability, and availability.
The solution to executing DevOps for maximum effect is to take a holistic approach to where you first try and understand the constraints of your current software delivery process with an eye to your short term and long term outcomes. Then the next important step is to empower your teams to decide on how best to accomplish those outcomes, after all, they are the experts in their work and context. Those who adopt this approach see more scalable and flexible solutions, and by not having to micromanage detailed execution plans, management can focus on high-level outcomes, allowing their organizations to grow. By focusing on designing and executing short-term outcomes that support the long-term strategy, teams are able to adjust to emergent and unanticipated problems, outperforming their peers whose three and five-year plans cannot be flexible and nimble enough to keep up with changes in customer demands, the technology landscape, or emergent security threats.
Concurrent efforts at team and organisation levels
Some capabilities are typically developed at the team level, while others—particularly in large organisations or organisations with strong hierarchical structures—often require organisation-level efforts. These two streams—team-level and organisation-level—can and should proceed concurrently, as they often support each other.
For example, creating a continuous integration platform that makes it easy for teams to get fast feedback on their automated tests can be a significant force-multiplier when used across several teams in an organisation. Similarly, deployment automation at the team level will have little impact if the team’s code can only be deployed together with that of other teams. This points to an architectural obstacle that must be resolved at the organisational level (which, in turn, is likely to require work from individual teams).
Remember that our goal is improving our ability to deliver software, which we accomplish through technical practices in delivery and deployment we call continuous delivery (CD). CD reduces the risk and cost of performing releases. Continuous delivery for the sake of continuous delivery is not enough if you want your organisation to succeed, however. It must be done with an eye to organisational goals such as profitability, productivity, and customer satisfaction.
Team-level technical capabilities
Test automation has a significant impact on continuous integration (CI) and in turn on continuous development (CD). With automated testing, developers gain confidence that a failure in a test suite denotes an actual failure just as much as a test suite passing successfully means it can be successfully deployed. The ability to reproduce and fix failures, gather feedback from tests, improve test quality and iterate test runs quickly also ties into automated testing.
Deployment automation, trunk-based development, and monitoring impact CD. Effective trunk based development is characterized by fewer than three active branches and branches and forks having lifetimes of less than a day before being merged to master. These capabilities may have dependencies on organisational-level work, as described for deployment automation. For example, teams can monitor their own code, but will not see full benefits if both application and infrastructure are not monitored and used to make decisions.
Organisation-level technical capabilities
In contrast to capabilities that can be implemented and executed at the team level for quick impact, some capabilities benefit from organisation-level coordination and sponsorship. Examples of these kinds of capabilities are those that involve decisions or design that span several teams, such as architecture or policy (e.g., change management).
Loosely coupled architecture has a positive impact on CD. A loosely coupled architecture is when delivery teams can independently test, deploy, and change their systems on demand without depending on other teams for additional support, services, resources, or approvals, and with less back and forth communication. This allows teams to quickly deliver value, but it requires orchestration at a higher level.
Code maintainability positively contributes to successful CD. Teams that manage code maintainability well have systems and tools that make it easy for developers to change code maintained by other teams, find examples in the codebase, reuse other people’s code, as well as add, upgrade, and migrate to new versions of dependencies without breaking their code. Having these systems and tools in place not only contributes to CD, but also helps decrease technical debt, which in turn improves productivity.
Organisations that elevate code maintainability provide real advantages to their engineers. For example, managing dependencies is hard. Updating a dependency could open a rabbit hole to issues such as breaking API changes, updating a transitive dependency, creating incompatible dependencies (for example, the diamond dependency issue), and breaking functionality. Tooling that can help avoid these errors or illuminate the consequences of code changes can improve design decisions and code quality for all engineers.
Useful, easy to use tools for deploying software
Advanced users such as developers, testers, and sysadmins were previously neglected when considering the usability of their tooling. Sometimes management assumed that—as relative technology experts—the technologists could figure out any tool they were given. This isn’t an uncommon mindset. In World War II, pilots were selected and trained based on their ability to operate overly complex cockpits. Then usability experts realized that complex work like piloting an aircraft was difficult enough. It was better to design a cockpit to be easy-to-use and understandable, and let pilots spend their attention safely piloting the aircraft. Other times, usability needs are ignored because management assumes that technologists’ needs are like those of regular end users. Today, we know that power users (such as engineers) often have special use cases, with unique design needs. Technologists also include broader skill sets and backgrounds—such as UX, infosec, and database engineers—as well as diverse abilities. Making tools that are accessible and easy-to-use is an important consideration for tool vendors. With this in mind, in this year’s research we studied the usability of the tools used to deploy software because technical practices that support software development and deployment are important to speed and stability. The usefulness and ease-of-use of this deployment tooling is highly correlated with CI and CD. This makes sense, because the better our tools are suited to our work, the better we are able to do it.
Disaster recovery testing
Every organisation that runs mission-critical software systems should have a disaster recovery plan and thoroughly test this. These tests need to be performed using production systems. The types of disaster recovery test that can be run are:
- Table-top exercises that are not carried out on real systems
- Infrastructure (including datacenter) failover
- Application failover
- Simulations that disrupt production-like test systems (including failure injection such as degrading network links, turning off routers etc.)
- Simulations that disrupt production systems (including failure injection such as degrading network link, turning off routers etc.)
- Creating automation and systems that disrupt production systems on a regular, ongoing basis
Organisations that perform disaster recovery testing at least annually are more likely to have higher levels of service availability—that is, the ability for technology teams and organisations to make and keep promises and assertions about the software product or service they are operating.
Making changes to software and production systems is often complex and bureaucratic. Two factors are responsible for much of this complexity: the need for coordination between teams, and requirements of regulatory control, particularly in financial services, healthcare, and government. While the complexities involved in implementing regulatory control requirements are beyond the influence of leadership and practitioners, we can influence the role that team coordination plays in change management—and that role is changing.
For example, segregation of duties, which states that changes must be approved by someone other than the author, is often required by regulatory frameworks. While we agree that no individual should have end-to-end control over a process (the intent of this control), there are lightweight, secure ways to achieve this objective that don’t suffer the same coordination costs as heavyweight approaches.
Heavyweight change process
Formal change management process that require approvals of an external body such as a change advisory board (CAB) or a senior manager for significant changes have a negative impact on software delivery process. Organisation were 2.6 times more likely to be low performers if they had this kind of approval in place. The reason to have a heavyweight change process is often supported by the fact that it reduces the risk of releases and is negatively correlated to failure rates ie errors found in production. However, a study run by DORA, Devops Research & Assessment, studying over 6 years and interviewing over 6000 organisations, found that there was no evidence to support that a formal approval process was associated with fail rates. However, the introduction of a more formal approval process does result in slower process and the release of larger batches less frequently. As a consequence the impact of a release to production is associated with higher levels of risk, thus a higher change fail rates. This has important implications for organisation working to reduce the risk in their release process as organisations tend to respond to problems with software releases by introducing additional process and more heavyweight approvals. Analysis suggests that this approach makes things worse.
What can organisations do to reduce the risks and the impact of releases?
We recommend that organisations move away from external change approval because of the negative effects on performance. Instead, organisations should “shift left” to peer review-based approval during the development process. In addition to peer review, automation can be leveraged to detect, prevent, and correct bad changes much earlier in the delivery lifecycle. Techniques such as continuous testing, continuous integration, and comprehensive monitoring and observability provide early and automated detection, visibility, and fast feedback. In this way, errors can be corrected sooner than would be possible if waiting for a formal review.
Clear change process
While moving away from traditional, formal change management processes is the ultimate goal, simply doing a better job of communicating the existing process and helping teams navigate it efficiently has a positive impact on software delivery performance. When team members have a clear understanding of the process to get changes approved for implementation, this drives high performance. This means they are confident they can get changes through the approval process in a timely manner and know the steps it takes to go from “submitted” to “accepted” every time for all the types of changes they typically make.
Change management is consistently one of the biggest constraints. Removing it requires work at multiple levels. Leaders at every level should move away from a formal approval process where external boards act as gatekeepers approving changes, and instead move to a governance and capability development role. Internal teams can implement continuous integration, continuous testing, and peer review to find bad changes as quickly as possible while also satisfying segregation of duties with technical practitioners building and automating change management solutions.
Culture of psychological safety
Organisational culture that optimises for information flow, trust, innovation, and risk-sharing is predictive of SDO performance. Indeed, Davis and Daniels cite culture as a key factor in successful and scalable technology efforts in their book Effective DevOps: building a culture of collaboration, affinity, and tooling at scale.
Research from a large two-year study at Google based on an organisational culture framework proposed by sociologist, found similar results: that high-performing teams need a culture of trust and psychological safety, meaningful work, and clarity. This team environment allows members to take calculated and moderate risks, speak up, and be more creative.
This researchers found that what really mattered was less about who is on the team, and more about how the team worked together. In order of importance:
- Psychological safety: Psychological safety refers to an individual’s perception of the consequences of taking an interpersonal risk or a belief that a team is safe for risk taking in the face of being seen as ignorant, incompetent, negative, or disruptive. In a team with high psychological safety, teammates feel safe to take risks around their team members. They feel confident that no one on the team will embarrass or punish anyone else for admitting a mistake, asking a question, or offering a new idea.
- Dependability: On dependable teams, members reliably complete quality work on time (vs the opposite – shirking responsibilities).
- Structure and clarity: An individual’s understanding of job expectations, the process for fulfilling these expectations, and the consequences of one’s performance are important for team effectiveness. Goals can be set at the individual or group level, and must be specific, challenging, and attainable. Google often uses Objectives and Key Results (OKRs) to help set and communicate short and long term goals.
- Meaning: Finding a sense of purpose in either the work itself or the output is important for team effectiveness. The meaning of work is personal and can vary: financial security, supporting family, helping the team succeed, or self-expression for each individual, for example.
- Impact: The results of one’s work, the subjective judgement that your work is making a difference, is important for teams. Seeing that one’s work is contributing to the organisation’s goals can help reveal impact.
Read about the researchers behind the work in The New York Times: What Google Learned From Its Quest to Build the Perfect Team
How do we improve productivity?
Another important goal in teams and organisations is improving productivity to get more value out of your transformation and your employees. This involves how organisations can support it with smart investments in tools and information, how technical debt interrupts it, and how it affects employee work/life balance and burnout.
Most agree that productivity is important: Productive engineers are able to do their work more efficiently, giving them more time to re-invest into other work,30 such as documentation, refactoring, or doing more of their core function to deliver additional features or build out additional infrastructure.
But what is productivity, and how should we measure it? Productivity cannot be captured with a simple metric such as lines of code, story points, or bugs closed; doing so results in unintended consequences that sacrifice the overall goals of the team.32 For example, teams may refuse to help others because it would negatively impact their velocity, even if their help is important to achieve organisational goals.
Researchers have discussed this topic at length, and most have come to the same conclusion:
Productivity is the ability to get complex, time-consuming tasks completed with minimal distractions and interruptions.
Many of us describe this as getting into a good work flow or rhythm. To use this model, locate the goal you want to improve in the figure, and then identify the capabilities that impact it. For example, if your goal is to reduce technical debt, these capabilities are code maintainability, having a loosely coupled architecture, and monitoring.
Useful, easy-to-use tools for increasing productivity
Useful and easy-to-use tools are now considered a must-have for consumer technologies, but these obvious characteristics are often overlooked among technology professionals who assume they are experts and can make any tool or technology work. (Or because those purchasing tools for these groups assume usability is less important for technologists, or are optimizing for other factors such as cost, licensing terms, or vendor management.) In fact, the opposite is true: When building complex systems and managing business-critical infrastructure, tools are even more important because the work is more difficult. We focused on tools used in deploying software through the CI/CD and test automation toolchain because they are at the heart of DevOps. We found that these two attributes drive productivity:
- How easy it is to use the toolchain (including straightforward and easy interactions and operation)
- How useful the toolchain is in accomplishing job-related goals
Tools usage by performance profile
- The strongest contrast among lowest and elite performers is the use of fully proprietary software developed in-house. Proprietary software may be valuable, but it comes at great cost to maintain and support. It’s no surprise that the highest performers have moved away from this model.
- There is a relatively equal concentration of commercial off-the-shelf (COTS) software with little customisation. Some may wonder why high performers can use COTS and still be high performers.
As Martin Fowler outlines in his article Utility vs Strategy Dochotomy , companies should be thoughtful about which software is strategic and which is merely utility. By addressing their utility needs with COTS solutions and minimising customisation, high performers save their resources for strategic software development efforts.
Elite performers automate and integrate tools more frequently into their toolchains on almost all dimensions. Although automation may be seen as too expensive to implement (we often hear, “I don’t have time or budget to automate— it’s not a feature!”), automation is truly a sound investment. It allows engineers to spend less time on manual work, thereby freeing up time to spend on other important activities such as new development, refactoring, design work, and documentation. It also gives engineers more confidence in the toolchain, reducing stress in pushing changes.
Internal and external knowledge sources
Finding the right information to help solve a problem, debug an error, or find a similar solution quickly and easily can be a key factor in getting work done and maintaining the flow of work. We found that having access to information sources supports productivity. These information sources come in two categories: internal and external knowledge sources.
Internal knowledge sources: Investments that support document and code creation as well as effective search for company knowledge bases, code repositories, ticketing systems, and other docs contribute to engineering productivity. Providing developers, sysadmins, and support staff with the ability to search internal resources allows them to find answers that are uniquely suited to the work context (for example, using “find similar” functions) and apply solutions faster. In addition, internal knowledge bases that are adequately supported and fostered create opportunities for additional information sharing and knowledge capture. If the organisation has invested in systems that can easily search across all types of information and data, the culture can contribute to a “virtuous cycle” of knowledge sharing.
External knowledge sources: These include external sources such as search engines and Stack Overflow. External search is important because these technologies provide strong communities for learning and growing and provide support for the use and adoption of public cloud and open source tooling. That is, leveraging commonly used external tools and systems with a strong user community and good ecosystem allows tech professionals to troubleshoot with the world, while proprietary and home-grown implementations only allow experts within an organisation to weigh in on possible solutions. Elite performers use more open source tooling and low performers have the highest use of proprietary data; these technology choices are bound to have an impact on productivity.
Technical debt was introduced in 1992 by Ward Cunningham35 to describe what happens when we fail to adequately maintain what he calls “immature” code.
Although immature code may work fine and be completely acceptable to the customer, excess quantities will make a program unmasterable, leading to extreme specialisation of programmers and finally an inflexible product. Shipping first-time code is like going into debt. A little debt speeds development so long as it is paid back promptly with a rewrite… The danger occurs when the debt is not repaid. Every minute spent on not-quite-right code counts as interest on that debt. Entire engineering organisations can be brought to a standstill under the debt load of an unconsolidated implementation.
In today’s complex systems, technical debt can occur in scripts, configuration files, and infrastructure as well as application code. Technical debt includes code or systems with:
- Known bugs that go unfixed in favour of new features
- Insufficient test coverage
- Problems related to low code quality or poor design
- Code or artefacts that aren’t cleaned up when no longer used
- Implementations that the current team doesn’t have expertise in, and therefore can’t effectively debug or maintain
- Incomplete migration
- Obsolete technology
- Incomplete or outdated documentation or missing comments
Technical debt negatively impacts productivity.
Actively reducing technical debt
How can we actually reduce technical debt and not just cope with it? One approach is refactoring. Refactoring is a “disciplined technique for restructuring an existing body of code, altering its internal structure without changing its external behavior,” and Martin Fowler points out that refactoring should be part of daily work. Better tooling with robust refactoring support built in are also important. In fact, many large organizations have invested in tools for doing code refactors across their code base; for example, Facebook open sourced its tool fastmod and Google has open sourced ClangMR.
Culture of psychological safety
A culture that values psychological safety, trust, and respect contributes to productivity by letting employees focus on solving problems and getting their work done rather than politics and fighting. This echoes work by other researchers; as we discuss in the section earlier, a study by Google found that this same kind of culture leads to more effective teams.
Additional benefits of improved productivity
The benefits to the team and organisation from higher productivity are usually obvious: more work gets done, so we deliver more value. But what about benefits to the people doing the work?
Research shows that productivity has a positive impact on work recovery. Work recovery is the ability to cope with work stress and detach from work when we are not working. Research shows that people who can detach from work have better well-being and handle work-related stress better. The reverse of this is also important: Feeling overworked leads to difficulty detaching, which leads to burnout and lower life satisfaction
Burnout has been recognised by the World Health Organisation as a condition that results from unmanaged chronic workplace stress, and it is more than just being tired. Burnout is a combination of exhaustion, cynicism, and inefficacy at work. Research found that work recovery can reduce burnout and hood technical practices and improved process (in the form of clear change management) can reduce burnout. Low performers are twice as likely to report feeling burned out to the highest performers.
How do you transform: what really works
So how do organisations spread new ways of working? Below we list common approaches used most often by high and elite performers for spreading best practices throughout an organisation.
- Encourage Communities of Practice (CoPs). COPs are organized groups of people who have a common interest in a specific technical or business domain. They collaborate regularly to share information, improve their skills, and actively work on advancing the general knowledge of the domain.
- Initiate grassroots/bottom-up movement. Grassroots agile is a common technique. One of the fundamental principles of Scrum is: do what’s possible. Do scrum on a team without the initial buy in top down. What will make Scrum effective is transparency so ensure to communicate up the chain why we’re doing this and the values that we’re seeing. The most effective organisational change is going to be both bottom up and top down.
- Proof Of Concepts (POCs). The purpose of a POC is to validate that a product or an idea is feasible and that investing more money and time into it will be commercially viable. In other words, Proof of Concept is like an experiment that is run in order to test out an idea quickly.
With these practices in mind, we can start to organise a transformation.