Gradient Shape Logo Element

Powered by

Join us

Welcome to

SREhub.io!

The new community for SREs, and the SRE-curious alike.

SRE IS EVOLVING.

By 2027, 75% of enterprises will use SRE practices organisation-wide.

Reliability is the backbone of customer satisfaction.


Digital systems are increasingly complex.


And this means SRE, and SRE skills, are in high demand because they are essential for navigating that complexity to automate reliability.


SREhub is here to connect SREs and showcase the strategic importance of Site Reliability Engineering.


MeetUp, skill up.

SREhub was born from the Site Reliability Engineering MeetUp, first held in March 2023.


Our in-person events are hosted in Edinburgh (UK), with London and more soon to follow!


Virtual events are also coming soon, so join the MeetUp community to stay up to date.

See MeetUps & events

Hot off the press

What is SRE?

Why is SRE so important, and why is there a community for SREs?

Light Gradient Circle
Gradient Shape Logo Element

SRE Meetup

The SRE Community

What happens at an SRE MeetUp?

Come along - the more the merrier!

Light Gradient Circle

SREhub at PlatformCon23

Tune in on 8th-9th June

Catch us at

Light Gradient Circle
Contraceptive Injection

Chaos engineering & digital immunity

How does Chaos Engineering help with digital immunity?

Light Gradient Circle
Teenager Stressed with Online Schoolwork

Reducing toil with automation

Give your SRE teams their sanity back.

Light Gradient Circle

Deploying fast & furious

Automating drift detection & remediation

Car Drifting Icon
Light Gradient Circle

REad all blogs

Coming Soon Placard Notice Board

Slack Community...

coming soon!

As our community grows, we want to provide a space for catching up between events and where members can participate in an open exchange of ideas and best practice.


If you'd like to be one of the first to know when the Slack Community is up and running, let us know!

let me know about slack

Join the SREhub community.

Sign up

Email

hello@srehub.io

Supported by

Gradient Shape Logo Element

Copyright Cloudsoft 2023

Gradient Shape Logo Element

SREhub

BLOGS

Gradient Shape Logo Element

22/04/23 - What is SRE?

Flat Line Icon

What is Site Reliability Engineering?

a woman with light pink short hair, wearing large glasses, smiles at the camera.

Charlotte Binstead

SREhub Community Manager

Hello world!


Welcome to the first ever blog on SREhub.io! And what better way to kick off the blog than with setting out why we're here.


Site Reliability Engineering (SRE) is a discipline that focuses on maintaining and improving the reliability and availability of large, complex software systems.


SRE combines software engineering principles with operations practices to ensure that these systems are scalable, efficient, and reliable.


SRE teams are responsible for monitoring, troubleshooting, and resolving issues with these systems, and for implementing measures to prevent those issues from occurring in the first place, usually via automation.


Why is SRE becoming more and more important?


Simply put? Technology systems are getting more and more complex, and these complex systems are integrated into our everyday lives. A seemingly small failure within a large and complex technology ecosystem can, if not addressed quickly enough, have a huge blast radius and cause a service-impacting outage.


And, to make matters worse in these complex systems, something, somewhere is probably failing at any given moment.

Site Reliability Engineering principles, and the SREs who implement them, aren't the antidote to complexity but they help to make digital systems more resilient (and therefore more immune) when issues inevitably occur.


SREs help to combat:


  • The cost of downtime: With businesses increasingly relying on technology, even a few minutes of downtime can result in significant financial losses; it's estimated that SouthWest Airlines took an $800m hit from their December 2022 outages. SREs help ensure that systems are resilient to failures and highly available, minimizing the risk of downtime.
  • Lack of scale and flexibility: As applications grow and user traffic increases, it becomes more challenging to scale systems while maintaining performance and reliability. SREs help design and implement systems that can scale rapidly and seamlessly, allowing businesses to keep up with demand.
  • Inefficient, risky manual work: Manual processes are time-consuming and error-prone. SREs use automation to streamline tasks such as deployment, monitoring, and incident response, increasing efficiency and reducing the risk of human error.


Overall, SREs play a critical role in ensuring that systems are reliable, performant, and available, and as technology continues to evolve, their importance will only continue to grow.


What is SREhub.io?

SREhub.io is a community for SREs and the SRE-curious alike.


SREhub.io is dedicated to creating a space where SREs and those interested in SRE can come together to exchange ideas and knowledge.


  • Whether you're a seasoned SRE or just starting out, SREhub aims to provide a space to network and learn from others in the field.
  • Members of the community will have access to a wealth of resources, including articles, tutorials, and forums that cover a wide range of SRE-related topics.
  • SREhub also hosts regular events, both online and in person, where members can connect and collaborate with one another. Interested in speaking? Email hello@srehub.io!
  • The community is committed to promoting inclusivity, diversity, and respect for all members, regardless of their background or experience level.
  • By joining SREhub.io, you'll not only have the opportunity to expand your knowledge and skills, but also to contribute to the growth and development of the SRE community as a whole.


Want to know more about SRE? Join the SREhub community!



About the author

Charlotte is the SREhub Community Manager and organiser of the SRE MeetUp. Based in Edinburgh UK, Charlotte's day job is as Head of Growth Marketing at Cloudsoft, where she spends alot of time thinking and writing about resilience, reliability and automation.

Flat Line Icon
Gradient Shape Logo Element

23/04/23 - What happens at an SRE MeetUp?

Flat Line Icon

What happens at an SRE MeetUp?

blob

Charlotte Binstead

SREhub Community Manager

blob

In March this year, we hosted the very first SRE MeetUp in a pub in Edinburgh, UK.


This was a social event, designed to get people interested in SRE around a table and to spark some discussions. We had a great turn-out, and hope to see these faces become regulars as the community grows!

Selfishly, this SRE social was also an opportunity for me to ask a range of people directly involved in designing, building and operating lots of different digital products in complex digital ecosystems what they wanted out of the SRE MeetUp. Their suggestions will form the backbone of topics we'll address over the coming months.

What kind of topics will be discussed?

At the SRE Social, I asked our attendees three questions:


  1. Why are you interested in SRE?
  2. What SRE challenges do you face?
  3. What topics would you like to see addressed?


And we got some great answers!

Why are you interested in SRE?

  • learning opportunity, sharing ideas & improving understanding
  • back-end infrastructure automation in AWS
  • friendly community
  • enhance my knowledge
  • tech challenges
  • knowledge exchange
  • how to be less 'ad-hoc' about failures
  • adding a £ value to downtime
  • automation & observability
  • automation: SRE focus more on automating repetitive manual tasks that helps everyone work more efficiently
  • problem solving.

What SRE challenges do you face?

  • tech debt
  • less toil please
  • cloud native landscape
  • response
  • recognition - what're the differences between DevOps, SRE, Platform Engineering
  • culture & team structure
  • getting buy-in
  • structured methodology for reliability
  • OS patch management
  • rushed software development leading to technical debt, poor code and poor automation.
  • collaborating with other teams
  • maintaining knowledge, keeping documentation & runbooks up to date.

What topics would you like to see addressed?

  • How to scale SRE?
  • automating observability
  • it's not just production!
  • Hashicorp tool suite
  • Culture shift required and different ways of thinking
  • Should it be "Site Reliability Engineering?"
  • Edge/end user device management.
Blob Shape Element

Volunteer to speak!

If any of these topics grabs you, or you've got a burning question you want to address, then we'd love to hear your thoughts! If you'd like to volunteer as a speaker please drop an email to hello@srehub.io and let us know what you'd like to talk about!

What's coming up at the SRE MeetUp?

Gradient Shape Logo Element

SRE Meetup

The SRE Community

26th April 2023

What's the difference between DevOps and SRE?

May 2023 (tbc)

Virtual event (tbc)

8th June 2023

PlatformCon Watch Party

See MeetUps & events

About the author

Charlotte is the SREhub Community Manager and organiser of the SRE MeetUp. Based in Edinburgh UK, Charlotte's day job is as Head of Growth Marketing at Cloudsoft, where she spends alot of time thinking and writing about resilience, reliability and automation.

Flat Line Icon
Gradient Shape Logo Element

25/04/23 - SREhub at PlatformCon23

Flat Line Icon

Catch SREhub at PlatformCon23!

blob
blob

Charlotte Binstead

SREhub Community Manager

On 08 - 09 June 2023, we'll be participating in PlatformCon23!

PlatformCon is back for its second year, with hundreds of talks from top minds in DevOps, Platform Engineering and Site Reliability Engineering for two days of online talks and discussions.

We're absolutely delighted to be participating, taking to the (virtual) stage to talk about creating a culture of digital immunity.

5 speaker tracks, hundreds of ideas

There are 5 speaker tracks at PlatformCon23. Our talk, Creating a Culture of Digital Immunity, will feature in the Culture track.


  • Stories
  • Tech
  • Blueprints
  • Culture
  • Impact
quote box frame

Developer platforms don’t live in a vacuum.

They are built by engineers for other engineers.


This track discusses the cultural aspects of platform engineering, from product management to how it relates to DevOps and SRE.

PlatformCon23

PlatformCon Watch Party - 8th June

Join us for a PlatformCon Watch Party on 8th June, from 6pm.


Once the agenda is available, we'll share it for our members to vote on which talks they'd like to watch.


Click the image below to save your spot!

About the author

Charlotte is the SREhub Community Manager and organiser of the SRE MeetUp. Based in Edinburgh UK, Charlotte's day job is as Head of Growth Marketing at Cloudsoft, where she spends alot of time thinking and writing about resilience, reliability and automation.

Flat Line Icon
Gradient Shape Logo Element

18/05/23- Digital Immunity & Chaos Engineering

Flat Line Icon

Building digital immunity with chaos engineering

blob
Pie chart

AUTO

REMEDIATION

CHAOS

ENGINEERING

TOIL

REDUCTION

CULTURE

SITE RELIABILITY

ENGINEERING

TEST

AUTOMATION

OBSERVABILITY

blob

Charlotte Binstead

SREhub Community Manager

This is helpful for Site Reliability Engineers (SRE)s because it allows them to proactively identify potential failures and weaknesses in their systems. The upshot is more resilient systems, reduced downtime, and improved reliability.

A brief history of Chaos Engineering

Chaos engineering practices are derived from Chaos Theory, which studies how complex and changing systems behave in response to seemingly random events. In complex, distributed systems, a data centre glitch or a missed bug can spiral into a huge and costly outage. Remember a couple of years ago when a customer configuration change took down 85% of Fastly’s network?

But the goal of chaos engineering is not chaos, it is improved reliability.


Which is why engineers now run tightly controlled, hypothesised chaos experiments; controlling chaos simulations in this way helps to collect useful data to improve the system and design future experiments.

This differs from testing, as testing seeks to validate expected behaviour whilst chaos engineering aims to cover unexpected behaviour in similarly controlled environments, as well as in production environments where it can uncover real-world issues that might not rear their heads in testing.


PlatformCon23

quote box frame

Chaos Engineering shines the light of reliability and resiliency on engineers' assumptions and educated guesses, exposing actual weaknesses before they are a career- or business-ending catastrophe.

Myra Haubrich, Senior SRE, Adobe Experience Platform

Why Chaos Engineering helps to build digital immunity.


Digital Immunity is a set of practices for reliability and resilience.


These practices are:


  • auto-remediation
  • chaos engineering
  • site reliability engineering
  • observability
  • test automation
  • toil reduction

If we compare digital immunity to human immunity, Chaos Engineering could be seen in the same light as a vaccine; it’s about exposing both the digital and human elements of our systems to a controlled threat or failure so we can build up the knowledge and technical requirements to recover from worse situations in the future.


Chaos Engineering experiments can expose where auto-remediation is needed, where human intervention can be automated away and where dependencies cause unexpected outcomes.

About the author

Charlotte is the SREhub Community Manager and organiser of the SRE MeetUp. Based in Edinburgh UK, Charlotte's day job is as Head of Growth Marketing at Cloudsoft, where she spends alot of time thinking and writing about resilience, reliability and automation.

This blog first appeared on www.clodusoft.io on 15th May 2023.

Flat Line Icon
Gradient Shape Logo Element

23/05/23- Reducing toil with automation

Flat Line Icon

Reducing toil with automation

blob

Linda King

Chief Go-To-Market Officer

Cloudsoft

Teams working in Site Reliability Engineering (SRE), Platform and Operations teams often find themselves heavily involved in undertaking tasks that are manual in nature and highly repetitive. This type of work is known as toil.

Toil: work that is manual, repetitive, automatable and reactive and that lacks enduring value

Examples of toil are:


For many teams, toil can be all-consuming - with some spending 90% of their time, or more, on toil. This means they’ll struggle to find the time to improve productivity and accuracy, let alone scale site reliability engineering operations by automating these often highly automatable tasks.


Toil, therefore, reduces the impact and value of SRE. Toil does not mean that a task is not important or should not take place. It means that it is non-value-adding engineering work.


The human impact of toil

Teams with high levels of toil are typically not happy teams as toil can have a serious detrimental impact on team morale.


Teams with a high toil-to-engineering work ratio often lack job satisfaction and will overall be unsuccessful teams as a result. High levels of toil typically also lead to higher than normal levels of staff turnover for these teams.


Reducing toil

Ultimately automation is the key to reducing toil. Whilst it can be tempting for organisations to aim for zero toil, this realistically isn’t practical or even possible to achieve.


The goal should be to rightsize toil to a manageable level, and what this is will depend on the individual organisation's size and growth rate. Industry analysts Gartner recommend for example that no more than 50% of a site reliability engineer’s time be spent on toil.


Benefits of reducing toil


Where do I start?

Whilst SRE teams will have many processes and workflows that can and should be automated, where should you start?

  1. Identify high-impact use cases
  2. Prioritise the automation efforts that will deliver the greatest benefit by identifying the biggest constraint in the workloads
  3. Then identify the next biggest benefit and constraint and so on.
  4. Look at undertaking a proof of value around one of your high-impact use cases with a tool like Cloudsoft AMP that will reduce your toil and fast.


From automation to autonomous systems

SRE best practice advocates the reduction of toil by using innovative tools and technologies to automate repetitive or error-prone tasks.


However, the longer-term and high-value goal as your SRE function matures should not just be around creating automated systems but autonomous systems that require minimal human intervention to make decisions.


Doing this moves organisations towards autonomous operations and optimised high-value engineering models.


Myra Haubrich, Senior SRE, Adobe Experience Platform

PlatformCon23

About the author

Linda King is Chief Go-To-Market Officer at Cloudsoft. She has had a successful 20+ year career in technology and has held senior positions in marketing, business strategy and product development.he spends alot of time thinking and writing about resilience, reliability and automation.

This blog first appeared on www.clodusoft.io on 22nd March, 2023.

Flat Line Icon
Gradient Shape Logo Element

05/06/22- Digital Immunity & Chaos Engineering

Flat Line Icon

Automating drift detection & remediation

blob

Charlotte Binstead

SREhub Community Manager

Imagine the scenario. You and your team have been working hard to get a new product into production, testing it thoroughly and making sure it’s secure, reliable and performant.


Your application passes from test to staging with flying colours but, as soon as you push to production, the errors start flooding in.


Why? Drift. *shakes fist*


There are several reasons why test and production environments can drift apart:


  • Configuration differences: The test environment may have different configurations than the production environment.
  • Version differences: The test environment may have a different version of software or hardware than the production environment.
  • Resource differences: The test environment may have different resources (such as CPU, memory, or storage) than the production environment.
  • Human error: Changes made by humans in one environment may not be replicated in the other environment.
  • Data differences: The test environment may have different data than the production environment.


And, as technology environments become more and more complex, the chances of one or all of these things happening, and therefore of drift occurring, increases exponentially.


The further along the road to production drift occurs, the higher the impact. Couple this with CI/CD pipelines deploying new code several times a day and remediating drift becomes an urgent priority.


Auto-remediation for drift.

Environment-as-code enables auto-remediation for drift. By expressing everything as code (not just infrastructure, but runbooks, policies, governance and more), it helps you to automate environment maintenance and be confident that your test and prod environments are continuously consistent.


You can detect when drift has occurred and effect a policy to automatically apply those configuration changes to the relevant environment.


The benefit of auto-detecting and remediating drift is that it helps you to avoid production errors caused by drift, and helps to ensure your product is as reliable as your tests proved it to be.


Can’t I just use Terraform?

Terraform detects drift, right? Right! But only for the resources it manages - and only when you ask it to.


You need to go beyond this, to auto-detection and remediation in all your resources, not just the ones managed in Terraform. Environment-as-Code actually makes Terraform better!

















Why automating drift detection and remediation is important for SRE.


  1. Reliability: SREs are responsible for the reliability of applications - and so ensuring that test, staging and prod are consistent goes a long way to achieving that goal.
  2. Toil reduction: SREs also prioritise automation to reduce toil. Manually identifying and remediating drift in complex environments is time consuming. For some teams, toil can be all-consuming, so automating drift remediation can significantly reduce toil.
  3. Improved customer experience: whether internal or external customers, ensuring products are released on-time and that they work as intended is key to customer experience.

Myra Haubrich, Senior SRE, Adobe Experience Platform

Chaos Engineering shines the light of reliability and resiliency on engineers' assumptions and educated guesses, exposing actual weaknesses before they are a career- or business-ending catastrophe.

PlatformCon23

About the author

Charlotte is the SREhub Community Manager and organiser of the SRE MeetUp. Based in Edinburgh UK, Charlotte's day job is as Head of Growth Marketing at Cloudsoft, where she spends alot of time thinking and writing about resilience, reliability and automation.

This blog first appeared on www.clodusoft.io on 27th March 2023.

Flat Line Icon