Made Tech Blog

Why lots of software projects get messy

Software engineers have a range of tools at their disposal to organise their software projects. At the most basic, we can use file names and directory structures. There are also more advanced tools that we can use, such as modules, functions, and objects. These tools can be combined to achieve the same user-facing outcome with different cost-of-change attributes.

It’s possible for one architectural design to allow flexibility in ways that help deliver the backlog, but for another design to double or even quadruple the effort required to deliver the same backlog.

There are many dimensions to the cost-of-change nature of a software project, and good software engineers consider this continuously. Simplified, it can be thought of in 2 ways: the baseline cost of change and the cost of change in a particular direction. The baseline cost of change can be considered the engineering effort required to implement any feature (of any shape or size). In contrast, the cost of change in a particular direction can be considered such that the design is optimised to allow the team to quickly implement changes of a particular nature.

Good teams keep their baseline cost-of-change low and learn from their historic backlog to make it easier to deliver more likely changes. This continuous effort helps keep the average pace (measured over a number of iterations) of delivery predictable.

Most teams aren’t aware of how to reduce their baseline cost-of-change and much less how to optimise their design for their circumstances. The most common warning sign is when your software engineers fear changing the code.

If you don’t have automated tests, start now

If your engineers fear making changes, you know 2 things: they aren’t optimising the architectural design, and the baseline cost of change is high. This fear of change comes from not being confident that the valuable “behaviour” (the software’s functionality) hasn’t changed or had defects introduced. Teams lose confidence in their ability to change when there isn’t a quick and easy way to verify the behaviour of the software.

The first way teams can drastically improve their baseline cost of change is to practise double-loop test-driven development. A common objection is that TDD isn’t possible on a project, but this is almost always never the case. TDD is possible in legacy code environments but requires slightly more care. In any case, to move beyond mediocrity, teams must adopt a disciplined approach to software engineering – TDD.

Useful books on this topic:

In software projects without tests, the most valuable thing that teams can begin to do is write tests for all new code and around all code that is changed. Feathers explains a number of techniques to achieve this. In addition, it’s worthwhile exploring mutation testing and approval testing to “characterise” the behaviour of existing code (these systematic approaches to test-last are often called “characterisation testing”).

Once a team starts progressing in getting their software under test, they will notice their confidence will begin to grow, and their change failure rate (failure demand) should decrease. You can expect this effort to take at least 2 to 3 quarters to begin to have a significant impact, but you should expect to see a few things within the first quarter. In particular, you should expect the team to begin finding pre-existing defects, eliminating unused code, and experiencing fewer surprises in production.

Thinking beyond homogeneity

One reason that teams struggle to optimise for specific backlogs is that their software design is too homogenous. It looks too much like other software projects that do other things! In web applications, this is often caused by teams following a cookie-cutter pattern which closely aligns with the tools they are using. These teams will often have 3 or 4 layers of code, which their code must fit into. A third-party framework often defines these layers and may use names like “Component”, “Reducer”, “Model”, “Entity”, “Controller”, “View”, “Route”, and “Form.” If your software design looks like every other software design, then you haven’t optimised your code for your backlog.

For lots of large projects, the pattern that causes many negative side-effects is the third-party framework-provided Active Record pattern. It’s seen in many MVC frameworks, from .NET’s Entity Framework to the classic Ruby on Rails ActiveRecord. As used in most software projects, it leads to a side-effect code smell known as fragility. Fragility is where a change to a feature causes an unexpected defect in a seemingly unrelated feature. The Active Record pattern encourages coupling across many different use cases of a system and often assumes that the validation rules exist at the database layer. These assumptions are sometimes true, in particular in smaller systems, but very quickly (certainly in less than a few quarters) become not true as systems grow in size. In addition, the coupling of business rules to database operations means that this bottleneck heavily constrains the execution time of automated test suites.

Ideally, software should be designed around the specifics of your application’s domain and the required layers should be chosen per application. This will allow the software design to be optimised to enable easier backlog delivery. The application will still have the layers defined by the third-party framework, but they are only used as a mechanism to deliver the business logic of the application. For example, in a server-side web application written in Python, the third-party framework layers would be responsible for HTTP, HTML and handling things like form submission. It would not be responsible for validating input, applying business rules, making external calls to APIs or interacting with persistence such as a database.

Domain-driven design and hexagonal architectures achieve heterogeneity between applications intended to solve different problems. You should expect teams using these approaches to find themselves mostly adding code rather than modifying code to add new features. In addition, the software design becomes evident as to what it is doing and how it relates to the users’ domain. This contrast is stark compared to logically mapping the user’s domain to the relevant third-party framework concepts.

The code design of systems designed using techniques such as these often cleanly separates business logic from other aspects such as the method of delivery (the web), other external systems (APIs) and persistence (database). This means that business logic can be tested in isolation from these slow bottlenecks, which provides extremely fast tests. These fast tests aid in changing the code structure quickly so that it can be optimised to deliver the backlog. Teams with fast tests (e.g. that run in under 10 seconds) can make more code structure changes with confidence in an hour than teams with slow tests (multiple minutes). Teams with glacially slow tests (fifteen minutes or even more than an hour) will tend not to change the structure of their code as they can’t justify the per-change return on investment to themselves.

Assuming equal capability in the chosen approach, teams which practise double-loop TDD but essentially outsource a lot of their software design to third-party frameworks will more often than not lose the race compared to teams that design their software to meet the specifics of the domain.

In the public sector, organisations often define rigid technology strategies which heavily constrain how software should be designed. It’s not uncommon for all teams to be told that they must use “Ruby on Rails”-way, “Django”-way or “.NET Core MVC”-way. These organisations will then hire individuals into technical architect roles who assure that teams follow the third-party framework’s lowest common denominator approach. These organisations are enforcing homogeneity of their software design, and while this is undoubtedly helpful to teams without the capability to optimise their software designs and adopt a heterogeneous approach. Many teams are capable but suffer the consequences of this strategy. Adopting this lowest-common-denominator approach ensures all your teams will only ever achieve a maximum of consistently average performance.

Work towards cohesive systems

Lastly, to explore, we are left with the problem of cohesion. I have seen many teams battle with a systemic lack of cohesion, and those teams are often oblivious to the cause or believe that it “has to be that way”. A lot of this has to do with the trend of the last 7 years towards single-page applications backed by “RESTful” JSON APIs. Teams have 2 code bases that they must change in lockstep to deliver features. To work around this problem, we’ve seen the rise of technologies such as GraphQL, which paper over the deficiencies of this approach.

What is the problem? There is an industry-endemic misunderstanding of what RESTful means. A defining feature of a RESTful API is Hypermedia as the Engine of Application State (HATEOS). However, most, if not all, “RESTful APIs” are, in fact, simply Remote Procedure Calls over HTTP dressed up in a sort of REST-like fashion. A common objection to pointing this out is, “So what? This is what modern REST APIs look like, HATEOS was a pipedream anyway.” Quite frankly, this objection misses the point.

Many software engineers would rarely think of the interface between a web browser such as Firefox and a backend web application that responds with HTML as an API, but that is exactly what it is! The web browser is a technology that supports a range of API features that allow software engineers to deliver rich user interfaces. This API is a Hypermedia, and so many server-side rendered applications expose a RESTful API in the form of HTML for the purpose of displaying rich UIs to the user. HTML doesn’t have to be the only hypermedia, and web browsers don’t have to be the only hypermedia clients.

So, why are single-page applications undesirable? It comes down to cohesion, a software property that attempts to define how well a system is organised and optimised for a backlog. Cohesive systems have one location to express a domain idea. For example, if we must validate a user’s input, there is one location to express the idea of that validation. Many teams these days find that delivering a feature, such as a validation rule, requires code changes and deployments (in lockstep) across 2 or more code repositories. Often it starts with the React frontend codebase dealing with presentation and the backend dealing with the business logic. This already requires 2 changes across 2 different codebases to deliver the feature, which is not ideal. Over time, we want more reactive validation, so some of the business logic begins to be duplicated in the frontend. This state of affairs is hardly cohesive and is not conducive to delivering at pace.

What’s wrong with a hypermedia-only approach? The HTML standard has too many limitations. Over the last decade, the industry has evolved an approach to prevent page flashing (the screen going blank after a link has been clicked, before the subsequent page loads). Page flashing feels like a cheap experience to modern-day users of the web. This approach was to build single-page applications (SPAs). In addition, there are challenges in delivering reactive experiences, such as typing in a form input that immediately updates the UI without a page reload. These limitations have led to more advanced frontend tooling – culminating in our current potted ecosystem of more than a dozen SPA frameworks.

Ideally, the HTML standard should be extended and improved. There are features it could have that would vastly reduce the need for JavaScript and, subsequently, the SPA ecosystem. A few tools demonstrate the feasibility of this: HTMX, Hotwire and a few others. These HTML-centric approaches can deliver reactive experiences that are almost entirely server-side, with hypermedia as the engine of the application state, by extending the functionality of the HTML standard. In the public sector, service assessments often ask for progressively enhanced citizen-facing applications that work seamlessly without JavaScript. These HTML-centric technologies make this possible without enormous amounts of engineering effort. Moreover, they are standalone libraries with no dependencies, meaning a team could have one JavaScript dependency and entirely side-step npm vulnerability hell.

Be deliberate with designing teams

Cohesion and team design are very closely related through what is known as Conway’s law. “Any organisation that designs a system (defined broadly) will produce a design whose structure is a copy of the organisation’s communication structure.” This fact has led to techniques such as the “Inverse Conway Manoeuvre,” which aims to solve software architecture design problems by (often radically) changing the organisation structure.

There are similar smells at the programme scale to those on the team scale. For example, how many teams do you need to involve to deliver an outcome? How often are those teams blocking each other from making progress? In an ideal world, a single team can always deliver outcomes within its remit without depending on any other team. In software terms, all the code that needs to be changed for them to be able to deliver is within their control.

Dependencies between teams add management overhead to an overall programme, requiring more roles between teams to keep them aligned and on track. What is often not recognised is that dependencies between teams add extra complexity to codebases because teams are forced to coordinate along the boundary between the code in their control – often through APIs.

A common pattern is to have a frontend team that delivers functionality in a “frontend” codebase, depending on an API built by a backend team. Both teams depend on each other to deliver outcomes, requiring coordination to deliver anything of meaningful value. A slightly improved organisation design might have frontend and backend specialists within feature-aligned teams. These teams share a codebase, so it isn’t perfect, but it will bring together individuals to focus on goals.

Sharing code between feature-aligned teams is not ideal because it still requires careful coordination between them and often leads to scenarios where teams need to agree on the strategic direction of the code design. This gives you the added overhead of needing additional roles to sit between teams and the overhead of making decisions and resolving conflicts that impact both teams.

Keep things simple

The industry has conditioned engineers to believe that everything must be some sort of JavaScript single-page application backed onto a GraphQL API deployed into a serverless function environment with multiple tiers of indirection, including cloud infrastructure glue. This does not have to be the case. I believe that public money is often “accidentally misused” by engineers believing they are doing the right thing by creating vastly overcomplicated solutions, following guidance they found online, for simple problem spaces.

Most digital services in the public sector can be delivered through building humble monoliths using “boring” server-side technologies (e.g. Flask, Sinatra, Rails, Django), ideally by outcome-aligned digital service (product) teams focussed on user needs. These teams can use tried and tested cloud infrastructure features such as containerisation (docker) and relational databases for persistence (PostgreSQL) to side-step every team needing to know the complete A-Z of your cloud provider’s catalogue. It is true that when more complicated approaches appear to be the industry norm, we can collectively forget that there are more straightforward ways to deliver. To escape the default of mediocrity, we must be disciplined and strive for simplicity.

About the Author

Avatar for Craig J. Bass

Principal Software Engineer at Made Tech