Troubleshooting software requires observability: We need to collect and analyze telemetry to formulate, disprove or validate hypotheses about why our software is behaving differently than we...
Troubleshooting software requires observability: We need to collect and analyze telemetry to formulate, disprove or validate hypotheses about why our software is behaving differently than we wanted.
Generative AI is growing to accompany us through the journey, and has the potential for it to take over more toil — especially with troubleshooting.
The Iterative Nature of Observability
A system is observable if we can figure out what it is doing based on data (telemetry) it emits. There’s many types of telemetry, called signals. The most commonly used are logs, metrics and traces.
Telemetry does not just happen. Our systems must generate it as part of their normal operations. The runtimes that host our applications can be configured to generate a wealth of telemetry out of the box, and so do our container orchestrations, operating systems and so on. We can also add to our applications dedicated logic, called instrumentations, that create additional telemetry. I think of it as application logic we pay forward to debug other application logic.
The iterative process of troubleshooting systems.
The telemetry generated by our applications is not always perfect and needs processing: We need to (spam) filter telemetry, because a lot of it is actually not that useful. We need to add context to telemetry, as the application that generates telemetry may not have access to enough information to properly provide all the necessary metadata.We also may need to ensure that the right telemetry is forwarded to the right observability backend in case we use different ones depending on the use case or the signal.
Once the telemetry gets to the observability backend, we must detect anomalies by looking for signs that something is amiss with our systems. And when anomalies are detected, we must troubleshoot the system.
And in every of these steps, AI is either already a helpful, powerful companion or has great potential to become one.
Artificial Intelligence in Instrumentation
AI coding assistants have a great potential to treat observability as the first-class functional requirement of systems it should be. Unfortunately, to date, that potential seems to be effectively untapped.
It’s not that AI is not capable of adding instrumentation: When you ask for it, it does a passable job. Yet code assistant tools do not generally add instrumentation by default, and they do not seem to know what telemetry is going to be useful given the kind of applications they work on.
In a sense, the invention is imitating the questionable habits of the inventor: Source code that humans write seldom comes with observability as a functional requirement. This is largely why we have many ways of automatically collecting telemetry from applications at runtime by adding instrumentation. And automatic instrumentation is perfectly fine: Much of the instrumentation related to the technologies we use does not need to be invented anew every time. The world needs exactly one set of metrics about Java garbage collection, and exactly one set of metadata about how to describe HTTP requests and responses.
In other words, about 80 to 90% of automatic, out-of-the-box, generic instrumentation is great and the best place to start your observability journey, but the remaining amount should be ad-hoc, application-specific telemetry that reflects the business aspects of your system.
Artificial Intelligence in Telemetry Processing
After telemetry is generated, it must be processed and routed for analysis. There are several things AI can help with in terms of processing telemetry:
(Spam) filter telemetry: Not all telemetry is equally valuable. Especially, the telemetry generated by auto-instrumentations is not consistently useful and tends to become indispensable only to explain anomalies detected elsewhere. I have not yet seen a system that uses AI for selecting which telemetry to keep beyond short-term storage, but I am very much looking forward to it.
Redact information: There are few systems that have never sent sensitive data over logs or telemetry metadata. AI should be able to detect many of these situations and act accordingly, though I have not seen this in practice yet.
Improve telemetry: Adding missing context, filling metadata gaps (like fixing missing severities in logs) and extracting important information as attributes that can be queued separately (for example, by automatically detecting log patterns).
Aggregate telemetry: Metrics are not a silver bullet: They are a way to frugally (with relatively few data points) represent important aspects of a system losing a lot of information in the process. Telemetry collection is the most likely area in observability where AI can shine. A lot of what observability looks like today is due to limitations we have as humans: Compared to software, we are slow, we mostly do one complex thing at a time, and we are in one place at one time. We collect swats of telemetry and are limited in how much of it we analyze. It can take us seconds or minutes to realize that something is amiss. We might not have the time to jump on a bug until next week, so we store a lot of telemetry for a long time.
But software scales way more than humans do. If (and that’s a big “if”) AI can both write and operate our systems autonomously, we will see a shift in which telemetry is collected and for how long. We’ll see dramatically less reliance on metrics and other pre-aggregated information, and much more event-like telemetry (logs, spans, etc.). We’ll see more collection on demand and telemetry stored for much less time.
There is, however, one qualitative difference between humans and AI consuming telemetry: AI needs radically more consistency. As humans, we can remember that we messed up the metadata and call the same thing in three different ways. If we come across team.id and team.identifier in the same troubleshooting, we know that something is up.
AI takes information at face value, since it lacks intuition and, to a large extent, the ability to amass experience. Moreover,AI generally does not return with clarification questions, although that may change. And this is why semantic conventions are so crucial for AI agents: They generally do not have built in the healthy realism about human fallibility that developers with experience have accumulated one disappointment at the time.
AI in Detecting Anomalies
In terms of observability, we live in captivating times. AI is poised to drastically change the way we generate and consume insights about what is wrong with our systems. It is a paradigm shift that goes well beyond “AI troubleshoots for you.” After a decade of unkept promises, it feels finally real.
For a long time AI has done a pretty good job of detecting anomalies, and I don’t see that changing much. Anomaly detection is a profoundly analytical, statistical and largely deterministic field.
The potential of generative AI here is largely to reduce false positives by running ad-hoc, additional sanity checks. That blends with the next step, and what everybody is currently excited about: troubleshooting.
AI in Troubleshooting
Troubleshooting is where AI truly is unlocking the next level of observability. Modern models with access to retrieval-augmented generation (RAG) and advanced, deterministic diagnostic tools can debug in a couple of minutes an issue that has left some of the most talented technologists stumped for half an hour.
GenAI can generate queries, dashboards or alerts, relieving the cognitive load of human operators during outages. This can democratize troubleshooting: It greatly lowers the bar, empowering all developers to be more effective toward issue resolution. This can free up time for the most experienced developers when problems can be solved without taking them away from other work.
The potential for AI to do a lot of heavy lifting in troubleshooting cannot be understated. But the most exciting part is that we have an entirely new paradigm of consuming observability insights.
Observability tool dashboards present a lot of numbers and charts in bright colors jostling for your attention. It is invariably overwhelming. Custom dashboards are only slightly more flexible. This is where the conversational aspect of Gen AI is at its best: When wielded well, it can tell the user in plain language exactly what they need to know. I yearn for the day that I will open my dashboard and read:
“The product catalog service has been having issues since the last deployment at 12:45, 2 minutes ago. The FindProduct API is consistently failing to retrieve information for a handful of product IDs. It does not look like a database issue. It is affecting on average 1024 unique users every minute and preventing them from completing the Checkout user flow.”
Imagine reading this, followed by a dynamically generated list of relevant visualizations presented as supporting evidence in a logical sequence. It could show hypotheses that were formulated and discarded, with narration explaining its reasoning just one mouse-click away. That future is not far away.
This does not mean that dashboards will go away entirely, but in a world where a narrative about an ongoing issue is available, a static dashboard seems a relic of the past.
It could even make observability a good experience on the small screens of mobile phones. Because GenAI can explain things sequentially, we will consume troubleshooting reports like we read post-mortem blogs.
Once a track record of reliability is built up, we might even eventually trust AI to make changes independently.
Thoughts About Design For Observability in the Age of AI
Interestingly enough, there are unexpected synergies between designing AI for observability and improving the observability experience for humans.
AI troubleshoots like humans, but at an industrial scale. Large language models, because they are trained on human content, emulate the way we do things. They just can do infinitely more of it. This means that...