Intro

Hello! This is @rbmrclo from Site Reliability Engineering team. Today, let me share about “Operation Trails” (a term we use in our team) which is an important part of our workflow when performing tasks that involve manual operation.

Background

In the SRE team, we have a 50/50 rule for how we manage our time every day.

To summarise, half of our day usually goes to proactive tasks which are generally the main projects that contribute to our growth as a diverse tech team (we usually have a roadmap for this). The rest of our time is spent on reactive tasks which are essential to maintain the stability and reliability of our services, as well as to keep the development speed stable across each team.

It can be visualised in blocks like this:

SRE tasks in Quipper

In this article, I will be focusing on our reactive tasks and explain in detail how we manage to work seamlessly within our team and avoid mottainai (I’ll be explaining this later).

Daily Situation

As a global company, each SRE member attends to the needs of multiple teams in different timezones. This also means that each member is working at their own pace.

Some members might be working on a normal routine today with their proactive tasks; some will be performing a maintenance task tonight (midnight!); and some might already be attending to a service outage incident while I’m writing this blog post!

Let’s illustrate that again with my favorite blocks.

My point here is that most of the time, each of us is working in an isolated manner. However, there’s one exception and this is when Operation Trails comes in.

Operation Trails for Reactive Tasks

Imagine that you are working on a task, with your headphones on, enjoying your favorite bubble milk tea, listening to the playlist of Queen, in-the-zone and cannot be disturbed by humans.

Suddenly, an alert has been triggered for a specific monitor. Say the staging cluster died, hence, no developers could connect to the staging servers to test their newly implemented features - a major blocker!

Call of duty. Upon receiving the alert message, you quickly checked the issue and created an Operation Trail.

  • First, you informed the other SRE team members that you are now checking the issue.
    • You are now considered as the assigned person. (ownership is part of our culture!)
    • This is also when the operation trail starts.
    • All SRE members are now informed that someone is checking the issue. They are also watching the operation trail in parallel.
  • Next, continuously post updates of what you’re currently doing. (who did what when - like audit trails!)
    • While posting updates, other SRE team members could either give suggestions, join the ongoing operation, or just watch the trail. (it all depends on the severity of the situation)
  • Lastly, you inform everyone when the task is finished or when the issue has been resolved. :tada:

Here’s the bird’s eye view of what happened.

Responding to alert (reactive task)

:memo: Every operation is in a single thread

:bell: Live reporting

:white_check_mark: Avoid operation conflicts by using call-to-actions

Summary

Slack Threads

  • In simple terms, operation trails are chat-based and happen real-time. We fully utilize slack threads for these.
  • An SRE member can start an operation trail and resolve it by himself/herself, or another SRE member can join the trail to speed up resolving the task at hand.

Avoid Mottainai (もったいない)

The term in Japanese conveys a sense of regret over waste; the exclamation “Mottainai!” can translate as “What a waste!”

  • By establishing a live reporting culture in your team, you can eliminate waste of time.
    • For example, when an SRE member initiates that he/she is already responding to the issue, the other SRE members can just watch the trail while working on their current tasks normally. They don’t need to pause as well, maximizing the use of their time.
  • By actively posting updates in the operation trail, other members can provide relevant suggestions or possible solutions in order to speed up the operation.

Being a team-player

  • Operation Trails improve the communication skill of an individual by being able to explain what’s happening and what they are doing.
  • As spectator of the trail, you can determine if the operation is going smoothly or a call for help is needed - evolving into a “pair operation”.
  • It also improves harmony in the team since this is one of the times when all of us in SRE team can meet and collaborate with each other, given that we have individual tasks too.

Acknowledgements

  • There’s also a blog post in japanese which is the main inspiration of this post.
  • Many thanks to all SRE members for supporting and adopting this culture. (especially @lamanotrama who introduced this during his time in Quipper)

Do you also have a similar live reporting culture in your team? Share it in the comments below and let’s discuss! We are hiring SRE members. Check it out!