SRE IS EVOLVING.
By 2027, 75% of enterprises will use SRE practices organisation-wide.
Reliability is the backbone of customer satisfaction.
Digital systems are increasingly complex.
And this means SRE, and SRE skills, are in high demand because they are essential for navigating that complexity to automate reliability.
SREhub is here to connect SREs and showcase the strategic importance of Site Reliability Engineering.
Hot off the press
Catch us at
Slack Community...
coming soon!
As our community grows, we want to provide a space for catching up between events and where members can participate in an open exchange of ideas and best practice.
If you'd like to be one of the first to know when the Slack Community is up and running, let us know!
Copyright Cloudsoft 2023
SREhub
BLOGS
22/04/23 - What is SRE?
What is Site Reliability Engineering?
Charlotte Binstead
SREhub Community Manager
Hello world!
Welcome to the first ever blog on SREhub.io! And what better way to kick off the blog than with setting out why we're here.
Site Reliability Engineering (SRE) is a discipline that focuses on maintaining and improving the reliability and availability of large, complex software systems.
SRE combines software engineering principles with operations practices to ensure that these systems are scalable, efficient, and reliable.
SRE teams are responsible for monitoring, troubleshooting, and resolving issues with these systems, and for implementing measures to prevent those issues from occurring in the first place, usually via automation.
Why is SRE becoming more and more important?
Simply put? Technology systems are getting more and more complex, and these complex systems are integrated into our everyday lives. A seemingly small failure within a large and complex technology ecosystem can, if not addressed quickly enough, have a huge blast radius and cause a service-impacting outage.
And, to make matters worse in these complex systems, something, somewhere is probably failing at any given moment.
Site Reliability Engineering principles, and the SREs who implement them, aren't the antidote to complexity but they help to make digital systems more resilient (and therefore more immune) when issues inevitably occur.
SREs help to combat:
Overall, SREs play a critical role in ensuring that systems are reliable, performant, and available, and as technology continues to evolve, their importance will only continue to grow.
What is SREhub.io?
SREhub.io is a community for SREs and the SRE-curious alike.
SREhub.io is dedicated to creating a space where SREs and those interested in SRE can come together to exchange ideas and knowledge.
Want to know more about SRE? Join the SREhub community!
About the author
Charlotte is the SREhub Community Manager and organiser of the SRE MeetUp. Based in Edinburgh UK, Charlotte's day job is as Head of Growth Marketing at Cloudsoft, where she spends alot of time thinking and writing about resilience, reliability and automation.
23/04/23 - What happens at an SRE MeetUp?
What happens at an SRE MeetUp?
Charlotte Binstead
SREhub Community Manager
In March this year, we hosted the very first SRE MeetUp in a pub in Edinburgh, UK.
This was a social event, designed to get people interested in SRE around a table and to spark some discussions. We had a great turn-out, and hope to see these faces become regulars as the community grows!
Selfishly, this SRE social was also an opportunity for me to ask a range of people directly involved in designing, building and operating lots of different digital products in complex digital ecosystems what they wanted out of the SRE MeetUp. Their suggestions will form the backbone of topics we'll address over the coming months.
What kind of topics will be discussed?
At the SRE Social, I asked our attendees three questions:
And we got some great answers!
Why are you interested in SRE?
What SRE challenges do you face?
What topics would you like to see addressed?
Volunteer to speak!
If any of these topics grabs you, or you've got a burning question you want to address, then we'd love to hear your thoughts! If you'd like to volunteer as a speaker please drop an email to hello@srehub.io and let us know what you'd like to talk about!
What's coming up at the SRE MeetUp?
26th April 2023
What's the difference between DevOps and SRE?
May 2023 (tbc)
Virtual event (tbc)
8th June 2023
PlatformCon Watch Party
About the author
Charlotte is the SREhub Community Manager and organiser of the SRE MeetUp. Based in Edinburgh UK, Charlotte's day job is as Head of Growth Marketing at Cloudsoft, where she spends alot of time thinking and writing about resilience, reliability and automation.
25/04/23 - SREhub at PlatformCon23
Catch SREhub at PlatformCon23!
Charlotte Binstead
SREhub Community Manager
On 08 - 09 June 2023, we'll be participating in PlatformCon23!
PlatformCon is back for its second year, with hundreds of talks from top minds in DevOps, Platform Engineering and Site Reliability Engineering for two days of online talks and discussions.
We're absolutely delighted to be participating, taking to the (virtual) stage to talk about creating a culture of digital immunity.
5 speaker tracks, hundreds of ideas
There are 5 speaker tracks at PlatformCon23. Our talk, Creating a Culture of Digital Immunity, will feature in the Culture track.
Developer platforms don’t live in a vacuum.
They are built by engineers for other engineers.
This track discusses the cultural aspects of platform engineering, from product management to how it relates to DevOps and SRE.
PlatformCon23
PlatformCon Watch Party - 8th June
Join us for a PlatformCon Watch Party on 8th June, from 6pm.
Once the agenda is available, we'll share it for our members to vote on which talks they'd like to watch.
Click the image below to save your spot!
About the author
Charlotte is the SREhub Community Manager and organiser of the SRE MeetUp. Based in Edinburgh UK, Charlotte's day job is as Head of Growth Marketing at Cloudsoft, where she spends alot of time thinking and writing about resilience, reliability and automation.
18/05/23- Digital Immunity & Chaos Engineering
Building digital immunity with chaos engineering
AUTO
REMEDIATION
CHAOS
ENGINEERING
TOIL
REDUCTION
CULTURE
SITE RELIABILITY
ENGINEERING
TEST
AUTOMATION
OBSERVABILITY
Charlotte Binstead
SREhub Community Manager
This is helpful for Site Reliability Engineers (SRE)s because it allows them to proactively identify potential failures and weaknesses in their systems. The upshot is more resilient systems, reduced downtime, and improved reliability.
A brief history of Chaos Engineering
Chaos engineering practices are derived from Chaos Theory, which studies how complex and changing systems behave in response to seemingly random events. In complex, distributed systems, a data centre glitch or a missed bug can spiral into a huge and costly outage. Remember a couple of years ago when a customer configuration change took down 85% of Fastly’s network?
But the goal of chaos engineering is not chaos, it is improved reliability.
Which is why engineers now run tightly controlled, hypothesised chaos experiments; controlling chaos simulations in this way helps to collect useful data to improve the system and design future experiments.
This differs from testing, as testing seeks to validate expected behaviour whilst chaos engineering aims to cover unexpected behaviour in similarly controlled environments, as well as in production environments where it can uncover real-world issues that might not rear their heads in testing.
PlatformCon23
Chaos Engineering shines the light of reliability and resiliency on engineers' assumptions and educated guesses, exposing actual weaknesses before they are a career- or business-ending catastrophe.
Myra Haubrich, Senior SRE, Adobe Experience Platform
Why Chaos Engineering helps to build digital immunity.
Digital Immunity is a set of practices for reliability and resilience.
These practices are:
If we compare digital immunity to human immunity, Chaos Engineering could be seen in the same light as a vaccine; it’s about exposing both the digital and human elements of our systems to a controlled threat or failure so we can build up the knowledge and technical requirements to recover from worse situations in the future.
Chaos Engineering experiments can expose where auto-remediation is needed, where human intervention can be automated away and where dependencies cause unexpected outcomes.
About the author
Charlotte is the SREhub Community Manager and organiser of the SRE MeetUp. Based in Edinburgh UK, Charlotte's day job is as Head of Growth Marketing at Cloudsoft, where she spends alot of time thinking and writing about resilience, reliability and automation.
This blog first appeared on www.clodusoft.io on 15th May 2023.
23/05/23- Reducing toil with automation
Reducing toil with automation
Linda King
Chief Go-To-Market Officer
Cloudsoft
Toil: work that is manual, repetitive, automatable and reactive and that lacks enduring value
Whilst SRE teams will have many processes and workflows that can and should be automated, where should you start?
From automation to autonomous systems
SRE best practice advocates the reduction of toil by using innovative tools and technologies to automate repetitive or error-prone tasks.
Myra Haubrich, Senior SRE, Adobe Experience Platform
PlatformCon23
About the author
Linda King is Chief Go-To-Market Officer at Cloudsoft. She has had a successful 20+ year career in technology and has held senior positions in marketing, business strategy and product development.he spends alot of time thinking and writing about resilience, reliability and automation.
This blog first appeared on www.clodusoft.io on 22nd March, 2023.
05/06/22- Digital Immunity & Chaos Engineering
Automating drift detection & remediation
Charlotte Binstead
SREhub Community Manager
Imagine the scenario. You and your team have been working hard to get a new product into production, testing it thoroughly and making sure it’s secure, reliable and performant.
Your application passes from test to staging with flying colours but, as soon as you push to production, the errors start flooding in.
Why? Drift. *shakes fist*
There are several reasons why test and production environments can drift apart:
And, as technology environments become more and more complex, the chances of one or all of these things happening, and therefore of drift occurring, increases exponentially.
The further along the road to production drift occurs, the higher the impact. Couple this with CI/CD pipelines deploying new code several times a day and remediating drift becomes an urgent priority.
Auto-remediation for drift.
Environment-as-code enables auto-remediation for drift. By expressing everything as code (not just infrastructure, but runbooks, policies, governance and more), it helps you to automate environment maintenance and be confident that your test and prod environments are continuously consistent.
You can detect when drift has occurred and effect a policy to automatically apply those configuration changes to the relevant environment.
The benefit of auto-detecting and remediating drift is that it helps you to avoid production errors caused by drift, and helps to ensure your product is as reliable as your tests proved it to be.
Can’t I just use Terraform?
Terraform detects drift, right? Right! But only for the resources it manages - and only when you ask it to.
You need to go beyond this, to auto-detection and remediation in all your resources, not just the ones managed in Terraform. Environment-as-Code actually makes Terraform better!
Why automating drift detection and remediation is important for SRE.
Myra Haubrich, Senior SRE, Adobe Experience Platform
Chaos Engineering shines the light of reliability and resiliency on engineers' assumptions and educated guesses, exposing actual weaknesses before they are a career- or business-ending catastrophe.
PlatformCon23
About the author
Charlotte is the SREhub Community Manager and organiser of the SRE MeetUp. Based in Edinburgh UK, Charlotte's day job is as Head of Growth Marketing at Cloudsoft, where she spends alot of time thinking and writing about resilience, reliability and automation.
This blog first appeared on www.clodusoft.io on 27th March 2023.