00:00 - 00:02 | Dimitri is on-call tonight |
00:02 - 00:04 | And he's drunk |
00:05 - 00:06 | Let me explain |
00:07 - 00:08 | Wait |
00:08 - 00:11 | That's not very responsible |
00:12 - 00:13 | It is when your systems are reliable |
00:13 - 00:16 | and you do agile post-mortems |
00:16 - 00:18 | What are agile post-mortems? |
00:18 - 00:21 | There's a leaner post-mortem process |
00:21 - 00:25 | not perceived as a bureaucratic burden |
00:25 - 00:27 | The goal is to understand how an accident could have happened |
00:28 - 00:29 | There is no root cause in complex socio-technical systems |
00:30 - 00:33 | So we just address the single most impactful change we can do |
00:34 - 00:37 | that will make our system more resilient, and we get it done. |
00:37 - 00:38 | The DB master is down |
00:39 - 00:41 | The DB master is down! |
00:41 - 00:44 | DB master? It fails over automatically |
00:44 - 00:46 | The service is up and running |