Rob Thatcher is co-author of the book Team Guide to Software Operability, published by Skelton Thatcher Publications, part of Conflux Books. Rob has substantial experience helping organisations to build effective technical operations, support, and delivery teams, and design and operate effective IT architectures. With Director-level experience in the financial services sector, his focus is on building high availability and high performance environments and teams with on-premise deployments, hosted private cloud environments, and public cloud services.
Interview by Jovile Bartkeviciute of Conflux.
Q1. Why is a focus on software operability so important for modern software delivery?
Modern software stacks are still required to operate across operating environments built, configured and maintained in a variety of ways. For all of the infrastructure of code and version controlled configuration managed cloud, hybrid and on-prem environments in existence we still see significant numbers of environments where human-managed tech stack is in use. Operability features in these environments are crucial to fault-recovery and system manageability, and can help the shorten the Mean Time To Recovery [MTTR] of components individually and systems as a whole. In those environments when modern approaches and techniques bring the benefit of versioned config management and infrastructure-as-code or paas, improving the operability of our software can help provide more telemetry of better and greater value.
Q2. You have a background in IT Operations in financial services. What operability practices were you using back then that work nicely now?
Back in the year 2000, I was working on a financial services platform for a company in London. We were deploying the platform with configuration managed from a central source, and from a single artifact source (although we didn’t call it that at the time, labels change, purpose may not); we had version checks and diagnostics able to run across the estate which meant the operations teams significantly reduced the instances of missing libraries and objects or incorrect versions in the deployment targets. Adopting consistent logging output from the various components in our distributed systems meant we were able to monitor the interconnected state of our trading platforms and shorten the feedback delay to business operations in the event of systems issues.
Q3. You are a big fan of large dashboards showing system health and metrics. In what way do dashboards help with operability?
I’m a believer in visualisation, yes. The adage that “ a picture contains a thousand words “ is underpinning this belief. Large dashboards with ‘just the right’ mix of information, can vastly improve the understanding of systemic health and performance enabling development, operations and business people a window into the ‘tech black boxes’ empowering a business model.
Q4. What one (or two, or three) things would you recommend that organisations do to improve the operability of their software systems?
There are so many things to choose from, yet I am still seeing a lack of consistent approach to logging both content and location, too little adoption of log aggregation, and underutilisation of healthchecks on distributed systems components (a precursor for example to building a distributed system healthcheck dashboard from the queries). These three things - consistent logging approach, log aggregation, and healthchecks - almost always improve operability and visibility of the workings of any system.
Follow Rob Thatcher online: @robtthatcher and https://www.linkedin.com/in/robthatchr/