(originally posted on tech emma)
Nobody likes admitting mistakes. Except this guy.
Getting people to go to a post-mortem meeting is easy. Getting people to participate without a sense of impending doom is hard. Most people don’t want to be there. They show up ready to fight or make excuses, with a pit in their stomach as they wait to talk about what went wrong.
So, how do you fix that pit-in-the-stomach feeling? We’ve worked on this a bit at Emma, and here’s my formula:
- Set high-level, achievable goals and have meetings even when things go right.
- Focus on how everyone will work together to make things better in the future, not what went wrong.
- Get everyone to participate.
- Share with the whole company what the group learned.
Now might be a good time to tell you that I wrote about some changes to our internal downtime process last week (read that post here); today I’d like to follow up with details about our version of a post-mortem meeting.
Set high-level, achievable goals and meet about success
A maintenance window here is considered a success when we make our changes, recover from any failures without impacting production and end on time.
As a group, we decided what’s okay to include in the window, and stripped out some riskier changes. Those included tasks that were hard to estimate time for, or ones that would push against the amount of time we allocated for testing. At this point, going into each window, we have a clear list of tasks, and we can assess success or failure of each task after the change.
In that first window in January, we completed the following:
- Upgraded our PostgreSQL databases
- Recovered 5% of the disk space on our largest database cluster
- Fixed a long-standing maintenance issue with parent tables on our largest database
We decided to have a meeting after the window — regardless of whether the change succeeded or failed.
Talk about what went well (aka Why I decided to call these meetings “wrap-ups”)
I always hated calling these discussions “post-mortems.” I get why tech people want to compare the process to a medical procedure, and I love a good zombie movie, but it sets the wrong tone. I decided to call them “wrap-ups,” to help make it clear that we’re there to reflect on the project, not find blame.
And here’s what we try to do in each wrap-up:
- Spend time talking about how things went well, and why
- Focus on how to improve future projects
- Distill what we learned
Documenting how the team manages maintenance windows makes the great work people were already doing visible. We also open up the meetings so non-IT folks at Emma can contribute and make them better.
Conduct the discussion for 100% participation
After a maintenance window, we communicate the outcome to the rest of our colleagues. Then, I schedule a 30-minute meeting with a simple agenda. We go over what happened during the maintenance window to:
- Discuss what went right
- Discuss what went wrong
- And determine what we could do to make things better next time
In our most recent wrap-up, seven people attended, and I requested at least one comment from each person on the agenda bullet points.
What we learned
In just 30 minutes, we came up with plenty of things that the group felt good about doing well and a set of clear changes to make in the future.
Here are some of the things people liked:
- Creating a custom error message for the maintenance window
- Having a phone bridge and using Campfire throughout the window to communicate
- Using a wiki page to organize tasks and each task’s owner during the maintenance window
- Using the change window to test out new Linux service scripts for the system administration team
This was our first maintenance window where we used both Campfire and a phone bridge at the same time for the whole team. We chose Campfire because anyone new who joined could easily see what conversation had already taken place. We used the phone bridge to make it simple to type commands and stay in touch at the same time.
In the past, we’d used email and RT tickets to document what was happening in the maintenance window. Everyone loved having a wiki page to reference and update instead. The wiki just had a better UI than email or a ticket, and provided a better experience.
Finally, the systems administration team used the window to test out new service start/stop scripts for a series of custom applications. This is the type of thing that can go un-exercised when you rarely have downtimes or maintenance windows. The team was smart to seize the opportunity!
We also thought a few things didn’t go so well:
- We didn’t give our customers enough of a heads-up.
- Steps for the changes should have numbers, not just times associated with them.
- Our testing took quite a while because the change affected all the databases at the same time, and tests only looked at one database at a time.
There may have been other things that people thought we could have done better, but we kept the list short and actionable. We’ll change the process slightly in the future to inform customers better, add numbers to all the steps and test databases concurrently.
Beyond this current window, I also asked everyone to imagine how we might do things differently or better during other downtimes.
A few ideas included:
- Trying out video conferencing during the maintenance, like Tokbox, to help make communication even better
- Pulling in more helpers for testing — for training, and making the workload lighter for the QA team
- Using Salesforce to communicate upcoming changes internally
My favorite suggestion, though, was:
- Playing “Point of no return” when we know everything worked
Feel free to comment below — I’d love to hear how you manage your meetings, and what you’ve learned.