DevOps is a relatively new function in Australia and many organisations are aspiring to an elite level of service.
Recently, DevOps Research and Assessment (DORA) released the 2019 Accelerate State of DevOps Report.
The report findings revealed the highest performing DevOps teams were 24 times more likely than low performers to execute on all five capabilities of cloud computing defined by the National Institute of Standards and Technology (NIST), which include on-demand self-service, broad network access, resource pooling, rapid elasticity and measured service.
The report also found evidence that enterprise organisations (those with more than 5,000 employees) are lower performers than those with fewer than 5,000 employees.
Heavyweight process and controls, as well as tightly coupled architectures, are some of the reasons that result in slower speed and the associated instability.
Some enterprises utilise rigid systems and processes as a result of prior incidents and experiences, whereas elite performers are nimbler and more adaptable.
The effects of rigid processes can be likened to post-traumatic stress disorder in humans and manifest as organisational trauma in businesses.
In describing the effects of trauma on people, Somatic Experiencing Trauma Institute director Dr Peter Levine wrote: “Animals in the wild are not traumatised by routine threats in their lives while humans, on the other hand, are readily overwhelmed and often subject to the traumatic symptoms of hyper-arousal, shutdown and dysregulation.
A zebra, for example, may choose a fight, flight or freeze response when chased by a lion, but should they be lucky enough to survive, will get up and shake themselves off, and resume their regular ‘rest and digest' state of being.
People are not zebras, and our pre-frontal cortex gives us the ability to replay and recall the physiological responses we felt during a traumatic incident.
For humans, the inability to break out of a Fight, Flight or Freeze response can result in post-traumatic stress disorder or PTSD.
In the same way people suffer PTSD, a traumatic event within an organisation such as an outage or severe incident, can result in a company-wide response that can leave a business paralysed.
In my role as DevOps evangelist at PagerDuty, I see firsthand what happens when humans are faced with a traumatic experience.
Our brains kick in with survival mechanisms.
These mechanisms are the familiar fight or flight response but can also include the freeze response - which occurs when we are terrified or feel that there is no chance of escape.
The concept of fight, flight, and freeze applies to organisations.
Once an organisation has experienced trauma (such as a large outage), the “memory” of that trauma leads to a deregulated state whenever activated (by symptoms of similar indicators, such as system alerts, customer issues, and more).
Based on my own experiences with post-traumatic stress (PTS), it's crucial to be able to identify organisational trauma and heal it.
If a human response fits within the normal window of tolerance, we can process and get through the trauma.
However, if not, we can find ourselves “stuck on” in fight or flight mode or “stuck off” in freeze mode.
People with PTS can be left feeling extended bouts of hyper-arousal resulting in panic and anxiety or hypo-arousal, leaving us feeling depressed and lethargic.
Similarly, organisational trauma can leave a business paralysed and unable to move forward.
A hyper-aroused organisation displays the effects of constant vigilance, hyper-aware of threats which take energy away from moving forward.
A hypo-aroused organisation can be frozen, also unable to move forward.
In humans, we treat PTSD with Eye Movement Desensitisation and Reprocessing (EMDR), which enables people to heal from the symptoms and emotional distress of disturbing life experiences.
In successful EMDR therapy, the meaning of painful events is transformed on an emotional level and can result in sufferers feeling empowered by the very experiences that once traumatised them.
Organisations, and particularly DevOps teams, can similarly overcome trauma through ‘Game Days' which involve practicing outages and incidents in a non-stressful environment.
Rather than exercising under pressure, teams can create an outage and resolve it in a game-day scenario, creating an organisational mindset that dealing with these scenarios is part of their day-to-day role, and not to be feared.
By normalising critical incidents and outages through repeated practice, organisations can demonstrate digital operational maturity, which means they become more effective at real-time work and can focus on performance metrics that improve as the organisation becomes more adept at responding to incidents and outages.
Some companies, such as PagerDuty, undertake “Failure Fridays” where they deliberately inject failure into a service and go through the steps to resolve it.
It normalises the incident response process so that it just becomes a matter of course.
Many organisations are also not using post-mortems (following incidents and outages) to best effect.
The incident response process isn't complete just because the incident is resolved.
A critical component of digital operations includes a blameless post-mortem process that allows teams to identify contributing factors, patterns and insights that can help prevent similar incidents from recurring.
All stakeholders must be afforded the opportunity to input, and by making the process ‘blameless', affected teams can use the report to learn and grow.
Ironically, it seems the larger the organisation, the less likely teams are to share the results of their post-mortems.
This process failure doesn't give those who experienced a critical issue the scope to process it - by nature humans are storytellers, and by providing teams with the opportunity to share and tell the tale, you empower them to process and heal.
Releasing the binds of organisational trauma through pressure-free incident response training such as Game Days or Failure Fridays changes a team's mindset from dreading that 3 am call about a critical incident, to perceiving incidents as regular day-to-day occurrences, they have been trained and are equipped to deal with as part of their standard operations.
It can enable companies to move forward and continue to focus on driving innovation and improve the customer experience.
Written by Matt Stratton, Digital Evangelist, PagerDuty
Matt has over 20 years' experience in IT operations. Ranging from large financial institutions such as JPMorganChase and internet firms, including Apartments.com., Matt is founder and co-host of the popular Arrested DevOps podcast.