When Good Architecture Goes Bad
Mark Dalgarno, mark @ software-acumen.com
Software Acumen, www.software-acumen.com
Every developer eventually encounters it at some stage in his or her career – the code that no one understands and that no one wants to touch in case it breaks. Sound familiar?
But how did the software get that bad? Presumably no one set out to make it like that? The answer is that the software is suffering from Software Erosion – the constant decay of the internal structure of a software system that occurs in all phases of software development and maintenance
At the architectural level, Software Erosion is seen in the divergence of the software architecture as-implemented from the software architecture as-intended. Note that when talking about the architecture as-intended I’m not speaking here about the initial planned architecture of the software system. Software architectures should evolve over time – this is to be expected as new requirements emerge – so the intended architecture is what your current conception of the architecture is. With software erosion what we’re talking about are unintended modifications or temporary violations of the software architecture.
The problem with software erosion is that its effects accumulate over time to result in a significant decrease in the ability of a system to meet its stakeholder requirements. Unless you take steps to actively pinpoint and stop software erosion it will gradually creep up on you and make changing the software further significantly harder and less predictable. In the worst case it could lead to the cancellation of the project or, for particularly significant projects, the closure of the business.
Types of Software Erosion
To begin to tackle software erosion you need an understanding of how it typically shows itself. Common types of software erosion include:
- Architectural Rule violations e.g. where strict layering between subsystems is bypassed.
- Cyclic dependencies – for example A calls B calls C calls D calls A. This type of dependency can be valid but when it’s unintended can lead to very complex, opaque code that is hard to understand and hard to test in isolation.
- Dead code – code that once supported part of the software, is now no longer used, but is still cluttering the code base.
- Code clones – identical or near-identical code fragments scattered across the system. A bug fix or change in one clone instance is likely to have to be propagated to the other clone instances.
- Metric outliers e.g. very deep class hierarchies, huge packages, very complex code etc.
A well-known example of software erosion was highlighted in a reverse-engineering experiment on two separate versions of ANT some years ago. ANT V1.4.1 (11 October 2001) and ANT V1.6.1 (12 February 2004) were reverse-engineered and the results were compared.
At the time ANT was built in three layers, from the top-down these were taskdefs, ant, utils. In the earlier version these layers were well separated and the ant layer was monolithic but small. In the later version the ant layer was still monolithic but had now become very large – making it harder to understand and work with. More problematically a new upward dependency from the lower-level ant layer to the top-level taskdefslayer had been introduced.
These types of erosion problems lead to code that is hard to understand, hard to modify and hard to test. But how do you know whether you’re suffering from software erosion?
Are you suffering from Software Erosion?
Perhaps the first thing to observe is that most projects will suffer from software erosion at some stage unless there is a conscious effort to pinpoint and stop such erosion. Even projects that are relatively short-lived can suffer from it. One example I have heard about involved a software project that had to be scrapped after only 6 months because it had already eroded badly.
There are some common things you can look out for when deciding how badly your software is suffering from software erosion:
- The time, effort and risk in implementing new functionality increase – productivity and quality decrease and complexity increases. These are very common side effects when software erosion is present.
- No one has responsibility for the architecture and knowledge of the architecture is held by a decreasing number of people.
- No one on the team can tell you (or agree on) what the intended or implemented architectures are. If you don’t have an understanding of either of these then it’s very likely that software erosion has occurred and will continue to occur.
- The team hasn’t had a stable core membership throughout the software’s life. If someone leaves the team then that person’s knowledge of the architecture and software leaves with him or her. New people take time to get up to speed on the project, so mistakes are made and the software erodes further. If new people are unlucky enough to be introduced into a team where no one knows what the architecture is or should be, then the software will erode even faster as they make changes to it.
- Little or no refactoring is sanctioned. Refactoring is the way to rollback software erosion once it has been pinpointed. Refactorings that remove architecture violations, eliminate code cycles, prune dead code, consolidate code clones and do away with metric outliers are particularly beneficial, because, by fighting software erosion, they clear the way for other refactorings, for bug fixes and for new features in the software.
- There’s pressure to rewrite the software. When software has eroded badly it becomes really hard for developers to work with that software. Every change and bug fix takes significantly longer in practice than it should in theory. The code becomes brittle and so even the simplest change can have unexpected knock-on effects which lead to costly rework. I’ll say more on rewriting later.
At a detailed level, software erosion results in problems such as code living in the wrong place, layering violations (as seen above in the ANT example), complex cycles insufficient decomposition, big packages etc.
Costs of Software Erosion
It can be hard to measure the cost of software erosion and convey this cost to non-technical people who often have to sanction work to stop software erosion. Even though software erosion causes reduced productivity, reduced quality and increased time-to-market, no one specific point of erosion causes these effects in isolation, rather it is the effect of multiple points of erosion that combine and reinforce each other to cause them.
However, a study by the US Air Force Software Technology Support Centre (STSC) attempted to put some rough measure on the costs of software erosion. The researchers took two versions of a mature software system (50k LOC) and asked two different teams to perform the same maintenance task (adding approx. 3k of code) on their respective version. Version 1 was an existing system suffering software erosion. Version 2 was the same system but with the architecture restructured to remove erosion.
The results were staggeringly different. Team 1, working on Version 1, needed over twice as long as team 2 to complete this relatively short task. Furthermore, Team 1’s results contained more than eight times the number of errors than the work submitted by team 2, working on version2. Erosion in a small system such as this still had the potential to lead to significant problems when the software was maintained.
Causes of Erosion
By now you should have some clues as to how software erosion comes about. It does not arise purely spontaneously. Software Erosion comes about through change.
Pressure for change comes from a variety of sources. The need to add new features to a product to help persuade people to buy it, changes to the environment within which the software is deployed e.g. to support different networking or GUI standards and technical changes, such as the desire to adopt new coding standards all have an impact on the software. Where the initial vision for the software doesn’t allow for change, such erosion effects will be seen very quickly.
Software Erosion is also known as software decay or code rot and by similar terms. However, these don’t adequately capture the notion that it is forces external to the software that are ultimately the cause of problems within the software. Erosion is not something that just happens to the code without someone actively making such changes. This is why I feel that notion of software erosion more adequately describes this gradual wearing down of the ability to work effectively with the software.
The needs of the business can also contribute to software erosion. Even though deliberately eroding your software causes bigger problems down the line it may be in the best interest of the business to do this for some short-term gain. The problems build up quickly however if the business does this repeatedly without spending time to refactor the eroded code. Every developer is familiar with the ‘quick-fix’ that becomes a permanent feature.
Real-World Examples of Software Erosion
How bad is this problem in practice? In 2007-08 I decided to investigate this question by running a number of workshops at different software events in the UK and by engaging in some discussions with some software practitioners further afield.
At every workshop I ran participants spoke about many different examples of systems suffering software erosion:
- Software with a large number of cyclic dependencies that ended up as brittle spaghetti code.
- Systems where business logic (with associated SQL) was captured in the software’s presentation layer – making it hard to replace this layer.
- A software system where the threading architecture eroded so badly over time that the system became unmaintainable and had to be scrapped.
- A single class used as a dumping ground for everything that didn’t have a better home.
- A ‘cancerous wart’ of a software system with ever increasing coupling between modules, packages etc.
- Lots of code clones (copies and near-identical copies).
- Uncontrolled code use – programmers grabbing code, classes and even variables from other parts of the software without any control on what could and couldn’t be used – once again led to significant erosion.
- Several examples of drive-by programming – team-membership constantly changing, programmers not understanding the architecture and so making mistakes when they coded and then moving onto their next project. One example of drive-by-architecting with similar consequences.
- Problems with obsolete software and hardware technology; a lack of skills in these obsolete technologies leading to further decay.
- Sales-driven evolution – where there was no clear roadmap or scope for the software system and so the implemented software architecture inevitably diverged rapidly from the intended architecture.
- Merged companies with different cultures and different principles having to collaborate on a software system leading to decay.
In every workshop all but a few people either were working on projects that had eroded quite badly or had worked on such projects in the past.
Case Study - Outsourcing of a 1MLOC C/C++ system
I outline below a real-world case study in order to get you thinking further about software erosion. My recommendation is to spend 10-15 minutes (either on your own or with a colleague who is also reading this article) thinking about the questions before proceeding to the discussion.
Case Study Project History
A company developed a software system over a number of years. Six years ago the software was transferred to a company-owned outsourcing centre in India where it has been developed since that time. At the time of the transfer the organisation believed that the architecture of the system as intended was well documented and matched what was implemented.
The software is critical and cannot be thrown away easily.
Over time more staff were added to the project to maintain a steady flow of new features. The company has a similar product that is maintained and evolved by 5 developers whereas the Indian department now has 50 developers.
The company recently compared the amount of work done by these two teams and assessed that they delivered roughly the same amount of work.
Acting on this difference in productivity the company compared the architecture from 6 years ago (as the outsourcing took place) against the architecture of the current code and found that many parts of the system have dependencies that are not intended.
The intended architecture was documented, so in theory all involved personnel could have compared actual to as-intended architecture. The initial architecture was probably appropriate for the current system (so it's a good architecture that has gone bad).
The company now intends to bring part of the software back under control in Germany while leaving part under control in India.
Think about whether it is credible that software erosion led to this significant decrease in productivity? What do you think of the company's proposed solution?
The software has been developed over a number of years; the team and their development processes; tools and technologies may have changed during this time. Given we can probably reason that the software has probably been modified a lot before it was handed over and so conclude that it’s likely that the architecture at the time of handover may have eroded.
There was a major personnel change 6 years ago when the project was handed over. The two different organisations will have different cultures, knowledge & skills. It is not clear that these differences will be lessened just because both organisations are part of the same multi-national. This could lead to further erosion.
We also have to consider the reasons for the switch and the way the switch took place. Did the organisation cut costs on the project when the software was handed over? Was there a backlog of work on the project that it was felt the new team could tackle sooner or better? How was the handover done? Did they redeploy the existing team elsewhere or did they fire them? Were people from the old team made available to help people from the new team get up to speed? How much time was the new team given to learn about the software before having to start modifying it? If there was no effective handover and insufficient time allowed for the new team to learn the architecture and the code base then erosion is more likely to have occurred.
We’re told that the ‘Software is critical and cannot be thrown away’. We’re also told that there’s been a steady flow of new features Both of these indicate that changes have and will take place implying that erosion could be present. This is confirmed by the assessment that there are a lot of unintended dependencies in the architecture as-implemented.
My belief is that it is credible that architectural decay contributed to the team’s problems but that it cannot be untangled from other issues.
- There have been lots of changes over the years.
- At the time of handover it wasn’t clear how closely the architecture as is matched the architecture as intended.
- There were lots of staff changes – how well was the handover managed? – this was initially a comprehension task that needed management and technical support.
Stopping Software Erosion
Stopping software erosion requires management commitment. If managers are only interested in the short-term viability of their software projects then it is hard for developers to get the time and make the effort to tackle the problem. This does not excuse developers from doing what they can to fight erosion but will inevitably make their struggle less effective.
If management commitment is present then the following outline pattern can be used to stop software erosion. How you implement the pattern depends on what tools you have available, what domain your project lies in, how mature the erosion problem is etc.
Stopping Software Erosion – a Pattern
- Start out with a sustainable architecture. – All successful software systems evolve; make sure you have built in flexibility for future known changes. Assess your architecture using the most likely change scenarios – where is it flexible, where will it need to evolve? There are always tradeoffs here in the amount of time you can spend in architecture assessment and also in the ‘finished’ architecture.
- When implementation starts regularly visualize the architecture as the software changes. Get a feel for how close your implemented architecture is to your architectural vision – maybe you need to change the latter.
- Compare the architecture as-implemented to your architecture as-intended to see how they differ. With automated support this can be done as part of the software build. This step does rely on you at least having some vision of what your intended architecture is. If you don’t have this then you can gradually reverse engineer it from your architecture as-implemented. There are now many tools from very basic free ones through to very advanced commercial tools that can help with architecture visualization and checking.
- Use cycle detection, clone detection, metrics analysis and dead code detection to pinpoint software erosion. Again there are several free and commercial tools that tackle some or these tasks.
- Refactor the software to remove eroded code.
Stopping Software Erosion – Cultural Factors
As noted above, if top management doesn’t support the fight against software erosion then developers have their work cut out to stop erosion. With management support you can create a culture where stopping erosion is valued. This culture is likely to have characteristics such as – an emphasis on regular refactoring, clear assignment of responsibilities, sharing of architectural knowledge and work, frequent communication between the whole group.
In Designing Maintainability in Software Engineering: a Quantified Approach Tom Gilb describes one team’s ‘Green Week’ – one week set aside each month to focus on improving their software’s maintainability. This proved more successful for the team than their earlier one day a week approach and had the added benefit of making the development team feel empowered.
A few words on rewriting
Before I wrap up I’d like to say a few words about software rewrites. As I noted earlier, pressure from development teams to rewrite software commonly manifests itself when that software has eroded. In the worst case the development team uses the excuse of a possible future rewrite to delay refactoring work to the software. When this occurs, the software continues to erode until it reaches a state where working with it becomes very difficult. Even if a rewrite may once have been avoidable if action had been taken the result is that a rewrite becomes inevitable due to the negligence of the team.
As a developer, when faced with a decision about rewriting some software you should always ask yourself whether you are planning to rewrite it for the right reasons. Is it because you cannot make the software maintainable or is it to get rid of code you haven’t tried hard enough to refactor or code that someone other than you has written? Worst still, is it just to get some hot new technology onto your CV?
As a manager ask yourself whether you can afford a rewrite? Do you have the right people with the right skills available for the right length of time? Do you understand the risks of new tools and technologies? Do you understand what you have to build? Are you rewriting the software or building something brand new? Worst still, how long will it be before your competitors catch up? In the Doomsday scenario, can your organisation handle the total failure of the rewriting project?
If you’re about to risk an expensive and lengthy rewrite of your software, are you really sure that you’ve exhausted every approach to fighting software erosion in your current code base?
Any successful software system is likely to evolve. Unless preventative work is undertaken the software will erode. As the software erodes the cost and risk of further development rises. It’s rarely too early to start fighting software erosion. The costs of software erosion start to bite very quickly once it sets in.
There are lots of different things that can be done to stop software erosion – you (just) need to work out what the best value approach is for your particular project. If you are a manager then create a culture where fighting software erosion is encouraged and supported. If you don't do this then no one will care about erosion. If you are an architect or developer then educate yourself about the different causes of erosion and the different approaches for fighting it. If you’re interested in finding out more, or sharing your ideas on stopping software erosion, then please get in touch.
See http://www.stsc.hill.af.mil/crosstalk/2005/11/0511SangalWaldman.html for more information on the Ant case study and http://codefeed.com/blog/?p=98 for a brief early Ant project history.
General Background Reading:
Lehman's laws of software evolution: M M Lehman, J F Ramil, P D Wernick, D E Perry, W M Turski, "Metrics and Laws of Software Evolution The Nineties View," metrics, p. 20, Fourth International Software Metrics Symposium (METRICS'97), 1997
Refactoring in Large Software Projects: Performing Complex Restructurings Successfully, Martin Lippert, Stephen Roock, Wiley 2006