(originally posted on tech emma)
Nobody likes admitÂting misÂtakes. Except this guy.
Getting peoÂple to go to a post-mortem meetÂing is easy. Getting peoÂple to parÂticÂiÂpate withÂout a sense of impendÂing doom is hard. Most peoÂple don’t want to be there. They show up ready to fight or make excuses, with a pit in their stomÂach as they wait to talk about what went wrong.
So, how do you fix that pit-in-the-stomach feeling? We’ve worked on this a bit at Emma, and here’s my formula:
- Set high-level, achievÂable goals and have meetÂings even when things go right.
- Focus on how everyÂone will work together to make things betÂter in the future, not what went wrong.
- Get everyÂone to participate.
- Share with the whole comÂpany what the group learned.
Now might be a good time to tell you that I wrote about some changes to our interÂnal downÂtime process last week (read that post here); today I’d like to folÂlow up with details about our verÂsion of a post-mortem meeting.
Set high-level, achievÂable goals and meet about success
A mainÂteÂnance winÂdow here is conÂsidÂered a sucÂcess when we make our changes, recover from any failÂures withÂout impactÂing proÂducÂtion and end on time.
As a group, we decided what’s okay to include in the winÂdow, and stripped out some riskier changes. Those included tasks that were hard to estiÂmate time for, or ones that would push against the amount of time we alloÂcated for testÂing. At this point, going into each winÂdow, we have a clear list of tasks, and we can assess sucÂcess or failÂure of each task after the change.
In that first winÂdow in January, we comÂpleted the following:
- Upgraded our PostgreSQL databases
- Recovered 5% of the disk space on our largest dataÂbase cluster
- Fixed a long-standing mainÂteÂnance issue with parÂent tables on our largest database
We decided to have a meetÂing after the winÂdow — regardÂless of whether the change sucÂceeded or failed.
Talk about what went well (aka Why I decided to call these meetÂings “wrap-ups”)
I always hated callÂing these disÂcusÂsions “post-mortems.” I get why tech peoÂple want to comÂpare the process to a medÂical proÂceÂdure, and I love a good zomÂbie movie, but it sets the wrong tone. I decided to call them “wrap-ups,” to help make it clear that we’re there to reflect on the project, not find blame.
And here’s what we try to do in each wrap-up:
- Spend time talkÂing about how things went well, and why
- Focus on how to improve future projects
- Distill what we learned
Documenting how the team manÂages mainÂteÂnance winÂdows makes the great work peoÂple were already doing visÂiÂble. We also open up the meetÂings so non-IT folks at Emma can conÂtribute and make them better.
Conduct the disÂcusÂsion for 100% participation
After a mainÂteÂnance winÂdow, we comÂmuÂniÂcate the outÂcome to the rest of our colÂleagues. Then, I schedÂule a 30-minute meetÂing with a simÂple agenda. We go over what hapÂpened durÂing the mainÂteÂnance winÂdow to:
- Discuss what went right
- Discuss what went wrong
- And deterÂmine what we could do to make things betÂter next time
In our most recent wrap-up, seven peoÂple attended, and I requested at least one comÂment from each perÂson on the agenda bulÂlet points.
What we learned
In just 30 minÂutes, we came up with plenty of things that the group felt good about doing well and a set of clear changes to make in the future.
Here are some of the things peoÂple liked:
- Creating a cusÂtom error mesÂsage for the mainÂteÂnance window
- Having a phone bridge and using Campfire throughÂout the winÂdow to communicate
- Using a wiki page to orgaÂnize tasks and each task’s owner durÂing the mainÂteÂnance window
- Using the change winÂdow to test out new Linux serÂvice scripts for the sysÂtem adminÂisÂtraÂtion team
This was our first mainÂteÂnance winÂdow where we used both Campfire and a phone bridge at the same time for the whole team. We chose Campfire because anyÂone new who joined could easÂily see what conÂverÂsaÂtion had already taken place. We used the phone bridge to make it simÂple to type comÂmands and stay in touch at the same time.
In the past, we’d used email and RT tickÂets to docÂuÂment what was hapÂpenÂing in the mainÂteÂnance winÂdow. Everyone loved havÂing a wiki page to refÂerÂence and update instead. The wiki just had a betÂter UI than email or a ticket, and proÂvided a betÂter experience.
Finally, the sysÂtems adminÂisÂtraÂtion team used the winÂdow to test out new serÂvice start/stop scripts for a series of cusÂtom appliÂcaÂtions. This is the type of thing that can go un-exercised when you rarely have downÂtimes or mainÂteÂnance winÂdows. The team was smart to seize the opportunity!
We also thought a few things didn’t go so well:
- We didn’t give our cusÂtomers enough of a heads-up.
- Steps for the changes should have numÂbers, not just times assoÂciÂated with them.
- Our testÂing took quite a while because the change affected all the dataÂbases at the same time, and tests only looked at one dataÂbase at a time.
There may have been other things that peoÂple thought we could have done betÂter, but we kept the list short and actionÂable. We’ll change the process slightly in the future to inform cusÂtomers betÂter, add numÂbers to all the steps and test dataÂbases concurrently.
Beyond this curÂrent winÂdow, I also asked everyÂone to imagÂine how we might do things difÂferÂently or betÂter durÂing other downtimes.
A few ideas included:
- Trying out video conÂferÂencÂing durÂing the mainÂteÂnance, like Tokbox, to help make comÂmuÂniÂcaÂtion even better
- Pulling in more helpers for testÂing — for trainÂing, and makÂing the workÂload lighter for the QA team
- Using Salesforce to comÂmuÂniÂcate upcomÂing changes internally
My favorite sugÂgesÂtion, though, was:
Feel free to comÂment below — I’d love to hear how you manÂage your meetÂings, and what you’ve learned.