EN 32: Pesky bugs and observability

For the last few days, I’ve been investigating a pesky bug. There’s something appealing to it, it’s the thrill of the chase. While investigating, you get to put on the detective hat, gathering clues and connecting dots, testing hypothesis and slowly piecing the puzzle together.

The detective hat can also be a scientist hat. You start with imperfect information, a phenomenon occurred, but there’s not a working model, a theory, that can explain it, test it, or reproduce it. To go from knowing next to nothing about how this phenomenon happened and how to fix it to a solution, I tend to formulate hypotheses first, and test them. Every time a hypothesis fails, there’s more information for the next one.

In all fairness, a bug sometimes can’t be reproduced at all or with enough consistency, specially when there are many elements involved, all happening at the same time. Think race conditions.

If you can reproduce it, a good idea is to write a test that manifests the bug, so when you have a fix, there’s proof that you’ve actually fixed it.

Talking about race conditions, this is actually one of the issues with the bug I’ve been looking into. Two or more requests to the backend happen in parallel, and all of them fail. So far so good, nothing relevant here, the spices that add flavour to the situation are:

  • Failed requests are retried.

  • Certain requests attempt to fetch information used by all queries before retrying.

  • Fetching this information can fail, in that case the info’s gets cleared.

There are a few more issues going on that I haven’t mentioned, and there’s no single root cause, it’s a combination of several factors.

Simpler bugs are not much a problem, they tend to be local, but complex bugs or outages are much more difficult to debug, specially in distributed systems. The typical tools, APMs, logs or metrics, are insufficient and cause a painful experience, they tell you something that happened after the fact, or show you an isolated data point. Using typical tools will have you look at a log and then try to look at the code for ages trying to understand what happened. You have to piece stuff out of disconnected information, bottom up.

Nowadays, I couldn’t live without having observability in systems, when you not only have logs and metrics, but you have distributed traces that can give you an end to end view. Importantly, being able to send events with any data that might be useful and then slice and dice them in any way you want is huge. With this way of working, you end up asking questions to your system without knowing deeply how it works internally. It’s a top-down approach, and you really get to put your detective hat, see outliers in the data and quickly zoom in and out. If this sounds crazy or foreign to you because you’ve always used logs, low-level debugging, let me tell you, your mind will be blown, I couldn’t go back. If you’re curious, check honeycomb.io sandbox and play around.

With proper observability, you also get a massive benefit, you can test and observe seamlessly your system and the user experience for your local environment to production. The idea’s that you add traces from the beginning, while you’re developing the feature end to end, you can see how everything behaves in real time in your local environment. When the feature is in production, you get the same experience but with real users interacting with your site.

Interesting links

Join the conversation

or to participate.