That right there is one common concern that binds modern software teams across different service categories and rapidly-evolving architectures. For someone who’s tasked with leading the organisation into the future, it is easy to overlook routine issues like capacity constraints, defects and regressions and irrational workloads. And when you throw in ‘scale’ to the mixture, getting reliability right comes across as a herculean task.
Difficult? Yes. Impossible? Absolutely not.
As an organization that works with close to 40 digital agencies based out of the UK and the USA, we’ve had our fair share of learnings while managing a team of over 100 front end developers that handles deployments, change requests and communications. The result is a set of processes that helps us deliver scale and complexity while also offering top-notch reliability.
How can your organization ensure this seamless coming together of scale, complexity and reliability? You can start by collating answers to these five vital questions.
How robust are your deploys and rollbacks?
The frequency of your deploys is immaterial here. Rather, the key is to ensure your deploy and rollback processes are robust and swift. Furthermore, can any member of your team do it? You can put this in front of your managers and team leaders to get a fair idea of your battle-readiness.
And if the answer is a ‘no’, either because of the quality (inadequate) or quantity (overbearing) of your work, it’s time to take a step back and focus on investing more on new resources, upskilling and tooling, where required.
Are you reliably catching regressions before full production rollout?
Is your team well-equipped to reliably catch regressions before customers face them?
Your team’s effectiveness at speedily resolving incidents is a direct function of your preparedness. For instance, the pre-production environment should have tools in place to catch configuration errors, major defects, and vital performance regressions prior to the deployment of new code or changes to the code. In practice, however, it’s not always possible to identify every issue at the pre-production stage, especially when scale comes into the picture.
So how does one limit the customer impact of any defects that might make it to production? Modern software teams get around this hurdle by employing canaries or feature flags in their deploy processes. Canaries or feature flags also make it easier to respond quickly if anything goes wrong. There is also the fact that partial rollbacks tend to be faster and safer than full redeploys.
Is your team ready to implement the technical or process-level changes that’d be required for these types of deploys and rollbacks to work? More importantly, are you ready to incentivize the adoption of these systems in your organization by investing in the required tools and techniques?
Do you have a risk matrix in place? If so, has it been updated to match your organization’s goals and challenges?
A risk matrix helps you identify and prioritize different risks by assessing the likelihood of their occurrence and the severity of impact if they do occur. Maintaining an updated risk matrix also helps identify where you need alerting and incident runbooks in place.
And because of the ever-evolving nature of your systems, you need to constantly revisit the risk matrices you’ve created and ensure they meet your organization’s current processes. Once a year is considered a good ballpark timeline to update your risk matrices. Ideally speaking, however, the risk matrix should be updated whenever you add a new service, technology or method.
How much free capacity does each of your service tiers have?
Capacity constraints are one of the major causes of service disruptions. In fact, operating without adequate free capacity makes your systems and processes more vulnerable to workload and latency changes or even small performance variations.
And while acknowledging the importance of free capacity is important, it’s even more vital to quantify it. So how much free capacity online does your company have across its different tiers? At eLuminous, we generally maintain a free capacity margin of 30 per cent, or equivalent to 90 days of workload growth, for every service tier we operate. If you are unsure about the accuracy of your capacity measurement, we recommend a more conservative approach towards arriving at your estimates. It’s better to err on the side of caution when it comes to estimating your free capacity safety margin.
Pro tip – For better accuracy, use the hotspots in your system as the base for your free capacity measurements, and not the averages.
Can your systems achieve scale without requiring significant architectural changes for the next 12 months?
The modern IT industry architecture is such that the most damaging reliability risks result out of scaling inflection points. Coming out of an inflection point unscathed requires significant work and collaboration between all your teams to address the core issue(s).
This necessitates sitting down with your team and charting a course for the next 12 months. So if you think you can’t get through the year without architectural or procedural changes, ensure all stakeholders know that changes are coming and make plans to deliver all the ‘necessary work’ before it’s too late.
So, how did you do?
If you’ve answered ‘no’ to any of the above questions, don’t be alarmed. That said, don’t simply shrug and move on either. The goalposts tend to shift constantly when it comes to reliability. But with sustained and significant procedural and systemic improvements, you can match a step with ever-changing reliability concerns and ensure your company never falls behind. Although it’d be impractical to expect your team to answer all the above questions in a single sitting or even over an entire quarter, incorporating small and timely improvements could help them deliver massive results.