The book Patterns for Performance and Operability by Ford et al is one of the few publications which addresses directly the operability of business software (which is partly why I am writing Software Operability: A Guide for Software Teams). Patterns for Performance and Operability ('PPO') is an excellent volume, containing many valuable insights into the ways we can improve the operability of software systems; this blog post explores a few of the key themes and ideas found in the book. <! data-preserve-html-node="true"--more-->The performance and operability of software is often given scant attention by software development teams driven by the delivery of end-user features by a product owner motivated largely by functional requirements ("Is the feature present?" rather than "Does the feature work well with 500 users?"). The so-called 'ilities' and performance are aspects of software which are often called 'non-functional requirements' and tend to be de-scoped during delivery; this inevitably leads to problems in Production, including lost revenue, re-deployments, hasty bug-fixes, and much gnashing of teeth by operations people and development teams alike.
By treating functional and non-functional requirements alike, and adjusting the terminology to 'end-user features' and 'operational features', those aspects which help to make the software really work well can be worked on alongside the regular 'user stories' or visible functionality.By articulating the nature of the operational aspects of software systems, the authors of PPO have helped to emphasise the need to take seriously the operability of software.
Themes
Performance
Much of the book is taken up with identifying and elaborating on different aspects of software system performance. Performance is a crucial aspect of operability; in fact, 'many of the operability tests that you will need to conduct can only be executed under load' (p. 171). Accordingly, capacity planning is outlined (p. 245); although flexible compute models (aka 'the cloud') reduce some of the complexity of capacity planning, they do not remove the need for it altogether, and PPO provides a useful introduction to capacity planning for on-premise, traditional off-premise, and cloud-hosting models.Types of Testing
The various different kinds of operational testing are explained, some of which will probably be an eye-opener to many software development teams. The need to identify and test specific boundary conditions is an important part of operational testing (p. 183). Soak testing, where non-punishing tests are run for a long time in order to flush out errors which occur only after the system is 'soaked' or carrying a lot of transactions, is a good example of the kind of testing which often gets missed out before going live. For instance, what happens to the software when its log file eventually fills the entire drive or file system? Such scenarios can be forced or simulated by using a small RAM disk during testing, in order to trigger any error condition quickly, without having to wait for the larger production-sized file system to fill up (which might take days normally).Failures
The way in which an organisation treats 'failures' can have a marked effect on the effectiveness of the software delivery effort. If every failure in Production leads to ever-increasing additional checks, tests, and (most destructively) blame, then future failures end up being more (not less) likely, as people retreat into the 'safety' of minimal effort and fear of change.The PPO authors rightly urge us to treat failures as 'canaries in a coal mine', alerting us to bigger problems. W. Edwards Deming advised us to avoid a blame culture based on fear of failure, and so to set up our delivery processes and practices so that we treat failures as an opportunity for learning, not for retribution and blame. The REAL failure is not allowing teams to learn from incidents; the blameless post-mortem review is a crucial part of helping that organisational learning to take place (p. 272).Monitoring
The DevOps movement puts a great deal of emphasis on monitoring; events such as Monitorama demonstrate the huge amount of interest and competence in the web-scale monitoring space. PPO covers some useful patterns for monitoring, including the importance of aggregating/grouping errors by type/message (p. 243) and the need for 'end-user' or synthetic monitoring (pp. 239-40):End-user monitors may not tell you what is wrong, but they are unlikely to fail to alert when your system is experiencing problems.The relationship between monitoring and trending (including capacity planning) is covered too.