Escape the Firefighting Trap: Strategies for Teams to Focus on Important Work

Do you find yourself constantly fighting fires and needing help keeping up with urgent customer requests? Have you and your team been so busy with these urgent tasks that you haven't been able to focus on developing anything new in months? It's a frustrating situation that can leave you wondering what went wrong and if there's a way to fix it.

The good news is that there is a solution, and this article will cover it. However, before we dive into the solution, there are a few pre-requirements to consider:

  • if you're serious about fixing this issue,
  • if you're a team leader or manager, or at least willing to propose solutions to your team leader or manager, and
  • if your team is currently running or willing to adopt some sort of sprint process, then keep reading.

But be warned: there's no silver bullet that will fix everything overnight, and you'll have to put in the work to execute the steps outlined in this article.


I present to you the ultimate solution: "UODP based three stages technics of working with toxic requests to the team." Not satisfied? Good. Let me dive deeper with you so you can be appropriately converted into a believer.

Let us define one simple and quick definition, toxic request in this article means any request to the team that comes outside of the team and satisfies one of the following parameters:

  • urgent (from the perspective of the requesting side)
  • directed to one of the team members and not to a team manager
  • broadcasted via chat or mail to the team
  • urgent support request that does not have any feature/bug-fixing work
  • urgent bug/feature that takes priority over everything else the team is doing now-now

Now we are ready to outline the solution. On the high level, it has three steps:

  • Identify. Identify which exact type of toxic requests the team is dealing with.
  • Isolate. Isolate the part of the team to shield the rest from the toxic requests, so at least some part of the team can do a meaningful development.
  • Reduce. Reduce the number of toxic requests (make it your business KPI)

Identify

You can not do shit unless you know what you are dealing with. And there are several things that you have to do to even start identifying it (as a manager). You have to have a rotation on your team with the primary, and the primary should own the task to be the first level of defense for everything customers ask of the team.

Expectations to primary should be to act as a shield. And by "acting as a shield," I mean something like this (you, as a leader, should create a set of expectations tailored to your team and your situation):

  • be answering questions in chat
  • be proactively triaging bugs/requests
  • be answering questions in the mail
  • be reacting on any team-level alerts

Again, this is just an example. It is your job, as a leader, to create expectations tailored to the team.

In addition to this generic list of expectations to the primary, there are some reporting expectations on top:

  • To have a ticket for the oncall, which is daily populated with rough items oncall spent time with.
  • If any request takes more than one hour, such a task should be filed as a ticket and added to the current sprint.
  • It is ok to ask other team members to help you. However, in such cases, the primary should:
  • Explicitly create a ticket
  • Assign it to a team member who is asked to do the work
  • Clearly mark such a ticket as additional work on the sprint (so it can be reviewed at the sprint end).

Since it is hard to say how much work will be required from the primary, it is better to start with a one-week rotation and expectation that 100% of the primary's time will be allocated to this work. Let me repeat this: at the beginning, no, other than being primary oncall, work should be assigned to the primary during the oncall week.

But setting expectations is easy. Execution and holding the team accountable is the hard part. This is where many leaders need to catch up. Any reasonable team will likely agree to all these things. However, after the first several sprints, you will see that the team still needs to utilize this newly crated shield. Customers will keep coming directly to the team members instead of going to the primary, and team members will not be redirecting such customers to the primary and instead will be trying to help directly. Everything will remain as is, at least in the first several sprints.

Fear not. This is just beginning and absolutely expected. After new expectations are set on each sprint review, you should start doing per-person reviews, and for any task that has not been finished due to the new urgent work being pushed to the plate, ask the following questions:

  • Why was it impossible to redirect it to primary?
  • Why is this request not filed as a task?

In 2-3 sprints, you will see that folks will start filling everything as tasks, and you will start getting meaningful information from the primary about the workload and nature of the workload and finally be getting closer to the assessment of the baseline of the toxic requests. This is the starting point. Knowing the type of toxic requests you are dealing with, you can move to the second stage.

Isolate

Knowing the amount of toxic tasks, you should understand how many full-time team members you need weekly to work on them. Now you can make an educated judgment call and have as big a rotation as you like (I will tell you what to do if you think you need all the folks on the rotation).

Now with that in place, here is a unified algorithm (or checklist if you are more of a pilot person) on how to use this newly created shield for anyone on the team. For any incoming toxic request, ask the customer:

  • Is it urgent?
  • yes => generate a ticket and redirect to the primary
  • no:
  • Can it wait till the next sprint?
  • yes => create a ticket, and add to the list of the things to triage during the next sprint planning
  • no => create a ticket and redirect to the primary

And for the primary, for any request/ticket:

  • Is it urgent?
  • yes =>is it important in your opinion?
  • yes => do you have cycles to work on it?
  • yes => work on it
  • no => find whom to delegate to (secondary/tertiary/etc.)
  • no => escalate to the manager to re-confirm and to communicate back to the customer (after all, it is not an engineer's job to be upsetting customers).
  • no => add to the backlog to plan for the next sprint

At this point, your team should have a tiny amount of cases where the shield is not used, and for each such case where the shield did not work, there should be a ticket on the sprint board that you, with your team, should discuss on the retrospective to see what is going on and why shield did not work (and how to fix it).

Next, let's discuss three main symptoms when the shield does not work and what to do:

  • people are not utilizing the primary
  • there is, indeed, a lot of requests, so the primary has been overrun by the customers
  • the primary can not help with the specific topic and have to engage with subject matter expert

Primary is Underutilized

This is indeed quite common at the beginning. Usually, if folks are not used to working in teams where you have primaries. This is indeed mentally hard if a customer asks you because you are the subject matter expert, and you still have to redirect the request to the primary, for whom it will take twice as long to do the same work.

We are all human. After all, jumping on the more understandable task is in our nature. If a customer is asking me specifically, this probably means that:

  • I am probably a subject matter expert in this topic (and for me, this means that the task is clear and understandable)
  • The customer's task is very likely urgent

Chances are, my current task is less understandable (and maybe even slightly less urgent). Given all these, it is easy to understand why we do what we do and jump on any opportunity to help the customer (regardless of whether primary can do it on our behalf). I had seen teams where for each new high-priority bug, more than HALF of the team would jump right away (they even had a chat notification so anyone could be distracted right when the new high-priority bug came in). Needless to say, such teams rarely move any long-term initiatives forward, and as a result, customer experience degrades more and more each quarter.

Hopefully, you, by now, have the answer to this problem (if you have been reading carefully enough). You should set expectations that no additional work should be done without the ticket on the board. And so, such cases (when new tasks were added to the sprint and were NOT escalated to the primary) will become prominent to you, and you can keep providing feedback to the person who keeps doing this. And if the pattern persists, you can always set a formal expectation to redirect such work to the primary in the future. Now it will be up to that person to deliver (or not to deliver on the job expectations).

A lot of Requests

Primary just can only deal with some things. That is fine. As I have mentioned, you can create as big a rotation as needed, and I had teams with up to three people on the rotation full-time. After all, your goal at this stage is to be honest and identify how many resources you need to cover all toxic requests. We will be talking about reducing them later.

There are still several things I want to mention. If you have more than one person on the rotation, the essential rule is: the primary owns the rotation. Think about the primary as a manager for everyone on the rotation for one week; the primary is solely responsible for the outcome of the rotation.

Now back to the question from above: what if you have SO many fires/tasks to do that you need your entire team to be oncall? If you need ALL of the team, chances are that you are NOT helping with every request anyway and dropping the ball somewhere. There is good news here, you are very-very-very likely, can allocate some resources (maybe an IC/week per sprint) to do some meaningful work, and no one will notice since, again, chances are that you are not solving all the incoming requests anyway.

Primary Can Not Help

I've been on teams where one person did NOT have enough knowledge to cover everything. In this case, it was not a problem of capacity but a problem of expertise.

I have seen two primary strategies to deal with it:

  • via fragmented rotation
  • via team education

The first one is simple, if you are lucky enough, the fragmentation of knowledge in your team can be split into two groups. And having two folks (one from each group) on the rotation would guarantee that you will always have full coverage. One should be primary and the other secondary.

While I have seen this working well in several teams, this requires a lot of luck:

  • you do need to have knowledge clustered into two groups
  • groups should be of more or less equal size
  • amount of people should be considerable enough for the rotation to be meaningful

The second way is to introduce the following policy for any request that comes into primary for which primary can not help directly due to the knowledge gap:

  • Primary still has to do the job via consulting with the knowledge expert. The knowledge expert still going to have added work on the sprint board, but the work should be in the form of educating the primary on the required topic.
  • primary, on the other hand, will own tasks to: learn, help the customer, and, most importantly updated the runbook/wiki/docs and potentially run team level training to make sure that next primary will be able to help with the similar request

It will take up to two quarters (usually), but ultimately this will elevate the entire team to the level where the team can help with almost any requests related to the team's products.

At last, we came to the final stage of the system (remember first two: identify/isolate):

Reduce

Now that we know the nature of the toxic requests, we have part of the team isolated to just help with toxic requests, we have a part of the team that reliably can do meaningful development, we are ready for the third and final stage: reduce the number of toxic requests. Here is the business KPI you should use to hold the team accountable: reduce team capacity required to run rotation.

Such reduction can be made by first asking different questions about the nature of toxic requests:

  • Are they bugs, and will they be reduced naturally after the team fixes most high-priority bugs?
  • Do we need to improve testing to reduce the bugs we introduce per release?
  • Do we need to improve our documentation?
  • Do we need a road show to explain how our products work?
  • Do we need to invest in UX?
  • Are we properly evaluating importance/urgency (maybe it is ok NOT to make some of the requests at all and stay more focused)

Again, these are just examples. I am sure you will come up with your own set of questions by reaching this point.

Small side note. Quite often, people might ask you a question about the relevance of this KPI to the business. Usually, it goes something like this: "if I am going to work X months on improving UX, how can I use this on my promo/scoring case? This has nothing to do with the revenue/usage increase (or whatever other important metrics the company's VP/Director/CEO pays attention to)."

The answer is always simple, if one can deliver on the KPI to reduce toxic requests, one can reduce the size of the rotation from, for example, 2.5 full-time engineers to 1.5 engineers. This means that such a person giving back 1 full-time IC to the team. So one can directly claim which business features were delivered per year by this team thanks to this work.

With that: may the UODP be with you! And have fun prioritizing.

A VERY IMPORTANT LAST COMMENT: feature requests with a high impact from the customer are not fire, not toxic requests, and if your team is confusing "work backward from the customer's needs" with "constant fire fighting," this confusion is entirely different problem that is not in the scope of this article.

There are so many things left unturned. I kept the article short and covered only the essence, but as a result, I cut a lot of the content, so please consider asking questions in the comments so I can make an educated call about the topics to write about next.