What's good software design?
I've been trying to come up with a definition of what a good software design or architecture is (here I'm kind of using the two words interchangeably). What are the characteristics of a good design versus a bad one?
Posted by
Martin ZokovRelated reading
Making software is not about code
It may sound counterintuitive to say that a software product is not about code. Sure, it’s the output of what software engineers do but there’s a much more important series of events that happen before that. One of the more fundamental components is communication.
What most companies get wrong about Agile?
Almost everyone building software has adopted some form of agile in the past decade. The concept of an agile workflow has even spilled into other departments within companies - devops, data engineering and sales all talk about it. Yet on software teams, people seem to be unhappy about their process.
Also, worth asking is why we need a good design? The answer in my opinion is… you don’t really. I’m sure there’s a huge number of software systems that aren’t well designed and they work just fine. However, I believe there’s a direct correlation between quality of your software design and cost - cost of adding new features, cost of fixing bugs, etc and these ultimately reflect on the value you bring to your customers. Some example proxy metrics could be churn or customer satisfaction. If you have a good design, you’re able to deliver value to customers quicker.
There are plenty of metrics one can list that immediately come to mind like availability and throughput. However, metrics don’t paint the full picture. I believe the more fundamental characteristics of a good design are more qualitative and not that easily measured. I’m afraid I can’t provide an easy answer but here are questions that help move a design in the right direction. Here are some of those questions and characteristics I believe are crucial to get right if you want a good design.
1. Elegance
And no, I don’t mean your fashion choices. An elegant design is simple and contains just enough complexity to get the job done. Complexity will inevitably increase as a product develops so it’s important to be careful to not add more than is actually needed.
So what exactly is complexity? It’s hard to define and quantify but there are some descriptors we can use. An example could be…
How much effort is required to navigate and find the relevant code in the code base? There’s a well-known principle for low coupling and high cohesion or to put another way - do things that belong together, live together? Are package, modules and repos organized in a fairly obvious and consistent way so that someone vaguely familiar with the code base could find what they need? Does the code base use conventions and patterns to organize source files? If every other developer is introducing a new style for doing things, soon your code will become like a house built with random household objects and duct tape instead of concrete walls. In most cases I’d favour sticking to consistency and popular patterns that would be easily understood. Sure, it’s good to have team autonomy but for fundamental decisions, a bit of centralised decision making helps a ton. Nobody likes writing documentation anyway so at least use patterns to help other developers…
How difficult/easy is it to trace the flow of information in a system? If you need to debug something, how long does it take a developer to trace where data comes from and where it goes through various source files. A lot of popular frameworks can hide important details of what happens with a request and would ultimately put more cognitive load on someone.
One of the biggest drivers of complexity is configurability. There’s always a trade-off between complexity and configurability. If you want to have a system where the behaviour changes dynamically based on variables like different countries or user type, you’re bound to add complexity and you don’t want a ton of if statements. It’s important to have a good understanding of the problem domain and think about what is likely to be dynamic.
Another question to ask is what deployment platform or runtime environment you want to support? Do you really need to be able to run on both Google Cloud and AWS? Both cloud platforms have their own intricacies and quirks which you’ll need to deal with and this will make it hard to build a software system that can run on both with a click of a button. This then translates into additional effort on a dev or devops team to support both clouds and in turn larger cost on the business away from driving value for customers.
These types of questions are fundamental and will inform your design for years. Even if you think ‘oh, we’ll change the architecture later’ both you and I know this won’t happen and those early design choices stick around. Even if you do come to a point where rearchitecting your stack is warranted, it’ll be a huge cost and will take months or even years.
There’s no silver bullet in any of this - it’s all about trade-offs and in asking some of those questions along the way. The best you can do, I believe, is use basic ideas like SOLID or CUPID design principles and Domain-Driven Design (DDD). An underrated tip here is to have the people defining the requirements work closely together with software devs so they can do what they do best - problem solve and develop understanding of the product. That way you can aim to build a system which in the long run has as few fundamental changes as possible. Having too much complexity prevents that and makes a system harder to understand or maintain. An elegant design stands the test of time and does not need to have large fundamental changes very often.
2. Testability
Here’s a good question for how easily tested your system is - how much effort/time is required to test some arbitrary functionality? Do you have an environment where you can within minutes (and I’m talking less than 5) deploy a new version of your API or system and try to reproduce a bug or do you need to create a whole new environment that will take an hour? One is clearly better than the other. If you’re working with HTTP-based endpoints one thing that greatly helps is having a Postman collection of commonly used requests that simulate happy path and crucial failure scenarios. Alternatively, it could be a suite of test data that’s easily accessible and preferably in source control. Also, how easy is it to set up or refresh the test data in your environment? If you’re able to have a clean database in the span of minutes, you’re likely in a good spot.
How many repos do you need to change to try out a new feature? If you consistently have to change multiple codebases in order to see how the system will behave, perhaps those need to be merged. This could be a code/organisational smell that your codebase is not quite as cohesive as it should be.
Going down to the code level, how difficult is it to write integration and unit tests? Ideally, having a layered architecture helps with testing as it makes testing easy by enabling mocking of any dependencies between different layers - think of the testing pyramid. It should be as straightforward as possible to mock any external dependencies like databases, external or internal APIs that you need to call. Do you trust your mocks? You should be able to trust your mocks to actually mimic the behaviour of a real dependency as much as possible, but bear in mind that it's advisable not to mock an interface you don't own - if they change the behaviour, your test becomes a false idol. Otherwise, your tests become useless.
3. Observability
As a system grows, it inevitably becomes harder to understand for any single person. So in order to understand what happens when something goes wrong in the system, we need a way of pinpointing the source of an issue. How long would that take a developer? Are the logs providing enough value to the person reading them? It’s really hard to walk the line between too much logging and not enough. Careful consideration in terms of what logging levels to use for each log message should be done. Always consider that the person looking at a log message may not know enough about the part of the system that’s at fault so try to guide them by imagining you know have no context. How much effort is required to understand what component yielded the error? What about the error itself - was something null or was an external API down? If so tell the log file, along with any identifying information. It’s important to provide enough information so that someone investigating can understand what’s happening as quickly as possible - future-you will thank today-you.
Another point on observability is tracing. In larger systems, you’ll likely have multiple moving parts. Being able to see how a request moves through a system is crucial for understanding where a fault occurred. How does a request transform when it moves through your system? Are you able to map out a journey of what information was added or removed from a request as it goes through from one end to another? The most crucial point here is what tooling is used to achieve that. If you need to manually do Ctrl+F or grep in 10 different log files spread among 5 servers… that’s not really helpful in terms of tracing. Developers will have to sift through tons of logs just to find some config file was not set up correctly. Tools like Kibana exist for a reason - just search for some request ID and get the whole lifetime of a request. Productivity increases exponentially.
Final Thoughts
At the end of the day, I don’t think anyone starts designing a software system with the aim of making a bad design… I think one of the most fundamental characteristics is simplicity. If a developer has to seek out ‘the guy who built this’ if they need to understand or maintain a piece of code, that’s a red flag. None of the points above are conclusive or real answers either, but I believe asking those questions while designing software does improve its quality. Fundamentally, a design that is simple to understand and maintain will be better from one that has too many moving parts.