SecDevOps.comSecDevOps.com
An End-to-End Cloud Native Observability Framework

An End-to-End Cloud Native Observability Framework

The New Stack(today)Updated today

It has been established that observability is essential to understand how our systems are performing and when something goes wrong with them. Through my work with enterprises adopting observability...

It has been established that observability is essential to understand how our systems are performing and when something goes wrong with them. Through my work with enterprises adopting observability services for their critical cloud native workloads, I see observability is adopted in silos. The requests are often separated: focusing on application traces or only on Kubernetes metrics/logs or only on CI/CD pipelines telemetry. However, my approach is to think about end-to-end observability from Day 1 and not treat it as a bolt-on. Here’s a demonstrated end-to-end observability framework using a simple two-microservices application built/deployed on a managed Kubernetes cluster through CI/CD pipelines. It focuses on the telemetry that can be collected from each layer— application, Kubernetes and CI/CD — and how each contributes to faster troubleshooting and better system health. Application and Observability Architecture Image 1 shows the architecture, including telemetry collection. The demo application includes two microservices — retail-web and retail-api — running on a managed Kubernetes cluster in a cloud environment. For this application, observability is covered through the following strategy: Traces from the application are collected using OpenTelemetry collector, which offers a vendor-agnostic way to receive, process and export telemetry data. Logs from Kubernetes cluster are collected using fluentd daemonsets (one log collection pod per node). Kubernetes infrastructure metrics are collected using the cloud platform’s proprietary agent. This could also be done using a Prometheus Node Exporter. Because this application runs on a cloud platform, telemetry from cloud native components such as CI/CD pipelines, compute nodes and the Kubernetes control plane is collected through the platform’s observability services. This demo retail application is focused on one feature: the checkout request. The user’s checkout request hits the load balancer, and is processed by the web frontend, which calls the API backend. The API backend then executes the three key operations – checking inventory, charging payment and creating the order, then all three operations hit a database. The API backend uses SQLite for demonstration purposes; in production, this would be a managed database service. The observability patterns remain applicable regardless of the underlying database. Note: A managed Kubernetes environment also includes several other cloud-managed components, such as networking and load balancing. It’s important to enable and monitor their telemetry as well since these services directly affect the reliability of Kubernetes workloads. Image 1 Capturing Application Traces with OpenTelemetry A trace captures the end-to-end journey of a single user request. A single trace is made up of multiple spans depending on how the request traverses the distributed system. In my application, a single user request starts when the user hits the Checkout button on the UI, moving on to the backend and SQLite database. A single trace tracking this request will have multiple spans and, in this example, individual spans will capture every HTTP request and operations such as “verifying inventory,”“executing payment”and “create order” as shown in Image 2. Image 2 The application uses a mix of OpenTelemetry auto-instrumentation (for Flask and HTTP calls) and manual spans (around the business logic stages), and all spans are exported through the OpenTelemetry (OTel) Collector, which enriches them with Kubernetes metadata before sending them to the cloud native application performance service. Image 3 shows a code snippet from my telemetry module, where I define a custom bootstrap() function. This function configures OpenTelemetry for my service by setting resource attributes such as service.name, service.namespace, and deployment.environment. These attributes become part of every span. Inside the same function, I initialize an OpenTelemetry TracerProvider and attach an OTLP (OpenTelemetry Protocol) span exporter, which is the component responsible for sending spans to the next destination in the pipeline, which in this case is the OTel Collector. Image 3 If You’re New to OpenTelemetry: OpenTelemetry Protocol (OTLP) is the standard way telemetry signals (traces, logs, metrics) are transmitted between components. The OTLP span exporter sends spans to whichever OTLP endpoint is configured. Although an OTLP span exporter candirectly send traces to an application performance monitoring (APM) backend, I intentionally send them to the OTel Collector instead. There are two reasons for this: Vendor neutrality and futureproofing. When the collector sits between the application and APM backend, I can route the same traces to any backend (Grafana Tempo, Jaeger, cloud native APM services, etc.) without modifying application code. Span enrichment and processing. I use the collector to inject Kubernetes metadata (pod name, node name, deployment, etc.) into spans. The collector can also perform batching, sampling, transformations and routing, if needed. Image 4 shows a code snippet with an auto-instrumentation section inside bootstrap(). Here, I enable OpenTelemetry’s instrumentation libraries for: Requests (Python’s HTTP client) to automatically create client-side spans whenever the application makes outbound HTTP calls (such as retail-web calling retail-api). Flask to automatically create server-side spans whenever the application receives inbound HTTP requests. Without the RequestsInstrumentor, the downstream API calls would not appear as part of the same trace, and the distributed parent-child relationship would be lost. Flask instrumentation handles inbound traffic, while requests instrumentation handles outbound calls. Because microservices often call each other, both are required to maintain a complete distributed trace. Image 4 Image 5 shows a code snippet from the application file (app.py), where the bootstrap() function is called immediately after creating the Flask application instance. This is what activates auto-instrumentation for the running service and applies the resource attributes defined earlier. Image 5 Manual instrumentation means explicitly creating spans in your application code to capture business logic, database operations or any work that auto-instrumentation cannot infer. In the retail-web service, by the time /checkout/execute handler runs, Flask auto-instrumentation has already created a SERVER span for the HTTP request. In the handler, I fetch that span with trace.get_current_span() and then I add a manual parent span called Process_Checkout_Flow with three child spans: Verify_Inventory_Status, Execute_Payment_Charge and Create_Order_Record. Process_Checkout_Flow sits between the HTTP-level SERVER span and these business steps. These spans map directly to my business steps, and I attach attributes like retail.flow and retail.stage so I can later filter traces by flow and stage if I want to. Image 6 below is the code showing the manual parent span and business span for the verifying inventory step. Image 6 Image 7 below shows the full span tree for one checkout request. The first span, retail-web: POST /checkout/execute, is the entry-point server span created automatically by the Flask auto-instrumentation. The next span, retail-web: Process_Checkout_Flow and the nested spans underneath it (inventory check, payment charge and order creation) are manually instrumented. These manual spans come from the code-level instrumentation as shown in image 6. Image 7 Auto and manually instrumented spans serve different purposes. If the latency comes from auto-instrumented spans, it can be attributed to overhead processes such as network connectivity issues whereas if it’s coming from manually instrumented spans, then it’s mostly because of application logic such as inefficient loops. OpenTelemetry generates three types of spans: Server spans — created whenever a service receives an incoming HTTP request (such as retail-web receiving POST /checkout/execute, retail-api receiving GET /inventory/check). Client spans — created whenever a service makes an outgoing HTTP call (such as retail-web calling retail-api). These show the outbound portion of the round-trip. Internal spans — spans created inside the service to represent internal units of work. All manual spans (such as Process_Checkout_Flow, Verify_Inventory_Status Execute_Payment_Charge, DB operations) fall under this category because they represent code-level operations performed within a service, and I didn’t explicitly set the span kind to any other type. Image 8 below shows different types of spans in the span tree. Image 8 I used the OTel Collector to enrich spans with Kubernetes context. Image 9 shows how I configured the OTel Collector to extract Kubernetes metadata. Image 9 Once this configuration is applied, every span sent through the collector includes Kubernetes attributes as shown in Image 10. This enrichment becomes extremely valuable when troubleshooting performance issues. If a span slows down, I can immediately see which pod and node executed the code. Image 10 Observability for Managed Kubernetes Environments In a managed Kubernetes environment on a cloud platform there is a shared responsibility model. The cloud provider manages the Kubernetes control plane components such as API server, etcd, scheduler, controller manager, node provisioning and exposes only the logs or metrics that the platform chooses to make available for control plane. Everything that runs inside the nodes (pods, containers, application processes) is fully user-managed and the user needs to own their observability setup. In essence, while all major managed platforms (Oracle, Google, Amazon, Azure) expose control plane metrics for their managed Kubernetes, but a user must generally opt-in or use the provider’s native monitoring solution to consume them. Essential control plane metrics...

Source: This article was originally published on The New Stack

Read full article on source →

Related Articles