Kévin Maschtaler

Developer at marmelab

Errors Budget

An SRE Principle

Errors Budget

Deployment

Reliability

Monitoring

Benjamin Treynor Sloss, VP of Engineering, Google

Get To Know What Really Matters For Your Users

Weather API

Availability

Database

Data Consistency

Stock Exchange App.

Response Time

Define A Service Level Indicator

Availability = \frac{Uptime}{Uptime + Downtime}
Availability=UptimeUptime+DowntimeAvailability = \frac{Uptime}{Uptime + Downtime}

Choose A Service Level Objective

Unrealistic & unreachable

Do more harm than good

99% ("two nines"): 3.65 days of downtime

99.9% ("three nines"): 8.77 hours of downtime

99.99% ("four nines"): 52.60 minutes of downtime

99.999% ("five nines"): 5.26 minutes of downtime

Focus on unplanned downtime

Errors Budget = SLI - SLO
ErrorsBudget=SLISLOErrors Budget = SLI - SLO
if (budget > 0 && !friday)
if (budget <= 0 || friday)

Recap

1. Get To Know What Really Matters For Your Users

2. Measure it (SLI)

3. Choose A Realistic Objective (SLO)

4. Align Team Behavior With The Errors Budget

5. Iterate and goto 1

- Risky

- Not Risky

Focus On Stability

Focus On Velocity

DEMO

About Site Reliability Engineering

  • leboncoin.fr
  • vente-privee.com
  • AirBnb
  • Amazon
  • Apple
  • Baidu
  • Dropbox
  • Etsy
  • Facebook
  • GitHub
  • LinkedIn
  • Netflix
  • Pinterest
  • Twitter
  • Uber
  • Yahoo!
  • Yelp
  • ...

Links