FILE: materi-27.module

Site Reliability Engineering (SRE)

Google approach untuk run production dengan reliability tinggi.

RUNTIME: 28m LEVEL: advanced STATUS: ACTIVE

1. Apa Itu SRE

Site Reliability Engineering. Diciptakan Google. Treat ops sebagai software engineering problem. Otomasi everything yang dilakukan manual. Bridge dev dan ops dengan engineering rigour.

2. SLI, SLO, SLA

SLI (Indicator): metric kualitas service. SLO (Objective): target SLI. SLA (Agreement): kontrak dengan customer plus konsekuensi. Contoh: SLO uptime 99.9 persen (3 nine).

3. Error Budget

Konsep penting. Kalau SLO 99.9, ada 0.1 persen error budget. Boleh deploy aggressive selama dalam budget. Habis budget, freeze deploy untuk fokus reliability.

4. Toil Reduction

Toil: manual repetitive ops work. SRE harus less than 50 persen waktu untuk toil. Sisanya untuk engineering improvement. Otomasi paksa, tidak compromise.

5. Postmortem Culture

Setiap insiden ada postmortem. Blameless. Fokus root cause, bukan blame individu. Action items concrete. Share lesson learned. Knowledge base untuk team.

Practical Mission

Baca Google SRE book (free online)
Define SLO untuk app sederhana
Identifikasi toil di team
Implementasi error budget
Tulis postmortem untuk insiden hipotesis

Recap Module

SRE: ops sebagai software engineering
SLI metric, SLO target, SLA kontrak
Error budget enable risk taking
Toil reduction less than 50 persen
Blameless postmortem culture