Move Fast and Improve Things | Part 3: Debugging and Technical Debt
Daniel McGrath is the Director of Engineering at Truebill, the DC-based fintech startup famous for canceling unwanted subscriptions. In this three-part series, we interview him about how he balances technical debt and product development while building a popular consumer tech product with a growing user-base.
- In Part I, we talk about what technical debt is and how you can tackle it (or better yet – avoid it). Daniel provides tips for tech teams to pay off technical debt without getting bogged down by it.
- In Part II, we talk about how tech teams can avoid wasting time on premature optimization. Daniel shares advice for tech leaders to strike the right balance between perfection and delivery, without sacrificing the quality of the end-product.
- In Part III, we focus on debugging, and how engineering team leaders can hire and structure their teams that resolve bugs and outages fast.
The conversation below has been edited for length and content.
How do you build a team that can tackle debugging and technical debt for a complex consumer product like Truebill?
A lot of the same strategies for balancing technical debt and avoiding premature optimization apply when you’re building out a team as well. Both systems (your product and your team) have to scale horizontally as well as vertically.
When your company size is 20 people, it’s easy to feel like you need to write a company policy that will scale out to hundreds of employees. But that’s like the over-eager loop optimization I mentioned earlier, where you’re trying to shave off milliseconds even though the code is already reasonably fast. Don’t try to solve scaling problems that don’t exist!
You’re much better off saying “You know what? Here’s the right answer for 20 people working fast on this product. But it won’t work once we’re 70 people, and that’s when we’ll revisit.”
The important thing is we know that. We understand that what works now won’t work when we’re much bigger and we’re prepared for change. Anticipate that there will be change and be ready for that, rather than doggedly trying to stick with the thing that has worked for you.
Right now we’re a small team, and we take advantage of that while we can. We’re able to spread around the load of much of the software process because we’re small and we don’t have too many separate development concerns to deal with. That’s much more difficult for larger companies – people naturally develop domain expertise in different areas and certain knowledge silos start to form.
What’s great about being able to democratize processes, involve engineers, and spread knowledge around on a smaller team is that it leads to a lot of ideas. You get these “Oh I didn’t even consider doing it this way!” moments that you wouldn’t have gotten if you have just two people on something.
When I hire new engineers, I keep these principles in mind. I don’t look exclusively for experts in Truebill’s tech stack. Instead, I look for people who can learn quickly and bring new ideas to the table.
Every developer we’ve hired has always been familiar with at least one bit of Truebill’s stack But with so many great pieces of technology out there, it doesn’t matter if someone is an expert in every piece of technology. A good developer can learn them quickly.
What is the most effective approach for an engineering team to debug?
I read once that debugging is not the process of guessing the correct answer, it’s the process of continually asking questions until there’s nothing left to clarify and the answer presents itself. The biggest mistake you can make in debugging is assuming you know the answer before you’ve asked the right questions.
Say you get an alert that says your API is down. The worst thing you can do is say “I know why this happened last time – I need to just restart the servers!”
Ignoring the fact that if this has happened a second time, it probably needed some more remediation the first time around, reacting too rashly can be costly. If you restart that server too soon, and it turns out it was actually an unrelated issue to last time, you’ve potentially blown away your debugging environment for the issue you were looking at.
That’s one of the reasons we lean on Datadog a lot. They provide all of our centralized logging, tracing, and monitoring for us. Oftentimes when we’re seeing issues in production, we’re able to pretty quickly bring up graphs or application traces that show specifically where we’re hitting a bottleneck and go from there.
If something is going really wrong – your site is down or a critical process is failing or something – it’s usually all hands on deck. This energy around fixing an issue is a good thing, but an uncoordinated response can lead to greater issues.
In my experience, you generally want someone to be the point person for the issue, and you want to make sure that everyone knows who that is. It doesn’t mean that they’re solely responsible for fixing everything, but it helps to have some more centralized state. Imagine if you had 5 engineers all investigating a production issue on one of your servers, and one of them thinks to restart it – they’ve potentially just ruined the investigation of the other four!
This also helps to avoid diffusion of responsibility. If “everyone” is looking into the issue, there’s a chance nobody actually is because they all believe someone else to be on top of it.
When you’re debugging something as complex as a major consumer financial app that’s being used by hundreds of thousands of people, you have to tackle debugging in phases.
- Stop the bleeding. Get the system stable and reduce the amount of cleanup you’re going to need to do later.
- Collect as much information as possible. Again, don’t assume you know the answer. Ask questions until there are no more questions.
- Start documenting and performing the follow-up work that you’ll need to do, including long-term fixes and any cleanup for users affected by the issue.
The debugging process doesn’t stop the moment the problem seems to be fixed. To avoid recurrences or similar problems in the future, you’ll often want to conduct a post-mortem with your team.
- What happened? It’s not enough to just say that a bug or an outage occurred. Instead, give a timeline. You want to create a historical record, like “On day 3 we did X, then two days later, on day 5, Y happened. Try to get to the root cause of the issue.
- Identify the impact on the user. Who did it impact? How, when, and for how long? How are users likely to react?
- What was the root cause of the bug or outage?
Ideally, in putting this together you identify some strategy to prevent the same issue from biting you again. Treat the follow-up with the same level of urgency you would the original issue to prevent it from falling off your radar.