01 September 2017
Developing a DevOps Cadence
The reason for this is that the weekly operations meeting prevents your handoff meeting from being an long chain of telephone between the members of the rotation. “Last week Joe told me that Mary had had some trouble with the servers but they seem to be doing fine now,” etc. It also gives the more senior members of your team the opportunity to audit and provide suggestions.
Here’s the cadence of a typical (half hour) team operations meeting:
- Previous on-call engineer runs the meeting. New on-call takes action items.
- Previous on-call engineer presents write-up summarizing issues from the last on-call. Were there new issues that came up? What did he or she do to fix them? This is the part where you find out that your latency’s been increasing for months and people have just been moving the alarm.
- Quick review of existing tickets. Are there any tickets that are over SLA?
- Dashboards. Pull open the dashboards for your services. Does anything look abnormal? (If so: there should be an action item to investigate). This is the part where your team learns what looks normal on their dashboards. You don’t want the first time they see the dashboard to be during an issue. Look for sudden changes, spikes, trends.
Like changing of the guard, the old on-call must continue to be watchful until he or she has received confirmation that the new on-call is on alert.
Here are some common items for a handoff meeting:
- Page the new on-call. This ensures that the new on-call’s pager is configured correctly and that nothing has changed since the last time they were on-call.
- Review all existing tickets in the ticket queue. Give an update on the status of each individual ticket. Where no correspondence has been added, it should be updated now.
- Where appropriate, off-load tickets from the on-call queue into ordinary sprint or team work.
Here’s the cadence of a typical weekly org ops meeting:
- Announcements. Things that went well; things that went poorly. Events that are coming up.
- Recent org-wide operational issues. Quick review of the incident and follow-up.
- Review of org-wide open issues/follow-up items that are out of SLA. The person who owns these items should provide a brief update on the status.
- Review of a recent incident. Brief discussion of lessons learned, suggestions from the rest of the org.
- Spin the wheel. Choose a random team’s dashboard. The should present, explaining any anomalies that appear. They should also explain the alarm level.