Software stability anti-patterns
On a stable system the user can get work done. Good stability does not necessarily cost a lot; a stable design usually costs the same to implement as an unstable one. However, poor stability carries significant real costs. If the system falls over and dies every day, nobody cares about anything else; short-term fixes — and short-term thinking — will dominate in that environment. Often, these kinds of systems never reach profitability, exhibiting low availability, direct losses in missed revenue, and indirect losses through damage to the brand. Thus, it’s important to be aware that design and architecture decisions are also financial decisions.
A tiny programming error usually starts the snowball rolling downhill until a production incident. When this happens, restoring service takes precedence over investigation. Then, managing perception after a major incident can be as important as managing the incident itself. While hunting for clues, some may be reliable, and some not, as usually people will mix observations with speculation. The postmortem can actually be harder to solve than a murder, because “the body” goes away, but log files, monitoring data and threads dumps may help. Once you know where to look, it’s simple to make a test to find it. Expecting every single bug to be driven out is a fantasy, but the key is to prevent them from spreading and turning into incidents.
Failing may look improbable if you consider isolated probabilities, but underneath the system layers are coupled, and events are not fully independent, so a fault (an incorrect internal state) opens a crack. Then, faults may become errors (visibly incorrect behavior), and errors provoke failures (an unresponsive system).
Although no two incidents share the precise chain of failures, patterns of failures do emerge that create, accelerate, or multiply cracks in the system.
The main source of cracks are integrations with other systems, and any of these integration points may cause a failure on the system. Furthermore, when your code isn’t defensive enough, failures will propagate quickly.
A slow response for example, is a lot worse than no response, as it ties up resources in the calling system and the called system, and tends to propagate upward; for websites, slow responses can even cause more traffic, as the user might start hitting the reload button, provoking a self-fulfilling loop. A common cause of slow responses are unbounded result sets, where the caller allows the other system to dictate terms. Unless your application explicitly limits the number of results it’s willing to process, it can end up exhausting its memory or spinning in a while loop. This failure mode can occur when querying databases or calling services; even a traversal such as “fetch customer’s orders” can return huge result sets. To avoid this problem don’t rely on the data producers, put limits into other application-level protocols, and use pagination.
Debugging integration points sometimes will require peeling back layers of abstraction and using packet capture tools may be the only way to understand what’s really happening on the network. To avert integration point problems you can use defensive programming and tests simulating network failures.
Sometimes “cracks” will jump from one system or layer to another, provoking a cascading failure. For example, a database engine that becomes slower can exhaust the client connection pool and all threads can end up blocked waiting for a connection. From the users’ perspective it doesn’t help if the system is hung out but running; if they can’t use it then it is a problem.
The most effective patterns to stop cracks from jumping the gap are circuit breakers, and defending with timeouts so no deadlock lasts forever; also you should avoid blocking your request-handling threads to scale better. External monitoring and metrics can reveal these kinds of problems, as it’s very, very hard to find hangs during development.
As your user’s base grows, traffic will eventually surpass your capacity (the maximum throughput your system can sustain while maintaining acceptable performance). Every additional user usually means more memory for storing sessions (sessions are all about caching data in memory), and when memory gets short, a lot of surprising things can happen. When there is insufficient space to allocate an object, an out-of-memory exception will raise, and the logging system might not even be able to log (external monitoring in addition to the log file should be used). To avoid getting out of memory, there are ways to automatically be more frugal using for example weak references. Another way to deal with per-user memory is having it on specialized server/s, such as Memcached or Redis. Of course, there is a trade-off between total memory on the same server and latency to access it on a remote server, because local memory is still faster than remote memory. When misused, caching can create new problems and presents a risk of stale data. A cache is a bet that the cost of generating it once, plus the cost of hashing and lookups, is less than the cost of generating it every time it’s needed. When invalidating, be careful to avoid the database “dogpile”.
It can be hard to see where else scaling effects will bite you. As the number of servers grows, then a different communication strategy may be needed, such as UDP broadcasts, multicast, pub/sub messaging or queues. Also, you should beware of unbalanced capacities between the different layers, as one side of a relationship can sometimes scale up much more than the other side, overwhelming it. To avoid this, build both callers and providers to be resilient; on the caller you could implement a circuit breaker to relieve the pressure on downstream services. As the development and test environments rarely replicate production sizing, these issues may be difficult to find, so virtualize QA and scale it up, as patterns that work fine in small environments might slow down or fail completely when you move to production sizes.
Another well-known anti-pattern is the self-denial attack. A classic example is the email from marketing to a “select group of users” that attracts millions because of a rapid redistribution of a valuable offer. Make sure that nobody sends mass emails with deep links and instead, try using static “landing zone” pages to protect shared resources of first clicks. Autoscaling can help when the traffic surge does arrive, but it may take precious minutes, so “pre-autoscale” before a marketing event goes out (for this to happen you should always keep the lines of communication open).
Sometimes, if demand is concentrated you might require a higher peak capacity than you’d need if you spread the surge out. This “dogpile” effect can occur because of different situations, such as running cron jobs at the same time, load tests, if the virtual user scripts have fixed-time waits or booting up all servers after a restart/deploy (the startup load can be significantly higher than steady state load, because of connections that need to be re-established, cold caches, etc.) To avoid the “stampedes“ due to cron jobs you can mix them up to spread the load out; on load tests, every pause in a script should have a small random delta applied.
After a server is down, it may still cause troubles, jeopardising the rest of the servers, as they pick up the dead one’s burden and their load increases, which makes them more likely to fail (due to some defect in an application or a resource leak). If your servers have autoscaling, as long as the scaler can react faster than the chain reaction propagates, your service will be available. You can also defend partitioning servers with bulkheads.
Wrapping up…
Faults can never be completely prevented, but we should keep them from becoming errors and failures if possible; denying the inevitability of problems robs you of your power to control and contain them. Once you accept that errors and failures will happen, you have the ability to design your system’s reaction to them.