Discussing end-to-end distributed tracing involves more than just tracing. It also encompasses other important components, such as metrics and logs. Therefore, a conversation about tracing is incomplete without addressing logs and metrics.
While traces provide information about the request flow and performance of individual services in your application, logs and metrics offer additional layers of observability. Logs give detailed, text-based records of events within your application, and metrics provide quantitative data on the performance and health of your system. Together, they offer a more complete picture of your application’s state.
In the first part of this series, we successfully set up end-to-end distributed tracing using Grafana Tempo and OpenTelemetry in a Kubernetes environment. We used a pre-instrumented Django application to send traces to Grafana Tempo through an OpenTelemetry collector. This setup used a Civo Object Store, and the trace data was visualized in Grafana.
Now, in the second part of this series, we will learn how to analyze these traces. Through this tutorial, we will examine spans to understand request flows and latency, how to identify issues or bottlenecks using metadata, and how to integrate Grafana Loki and Prometheus as additional data sources in Grafana for a complete analysis of logs related to the traces and metrics for performance.
Analyzing Trace Data
At the end of the previous tutorial, we could view traces in Grafana which are outlined in the image below. This image shows that the trace GET /create
took 5.98 milliseconds. Additionally, we have details indicating the successful completion of the request with a 200 HTTP status code, signaling that the operation was executed without errors.
In distributed tracing, a trace is a collection of spans, where each span represents a specific operation or segment of work done in the service. Spans within a trace can have parent-child relationships that show the flow and hierarchy of operations.
In this particular trace, we observe a breakdown of individual spans, including the note_create
span within the django-notes-app
service. This note_create
span, which took 2.57 milliseconds, is a child span of the GET /create
span.
As a child span, it represents a discrete operation, or a part of the processing that contributes to the overall response of the GET /create
request. This hierarchical relationship between spans is crucial for understanding the flow of requests and identifying areas within a service that contribute to the total execution time.
For a more comprehensive analysis of the trace, you have the option to export the trace data. This can be done by clicking on the export icon highlighted in the image below 👇
The exported data provides detailed information about the trace, such as:
- Services involved, such as
django-notes-app
- Span details, including trace ID, span ID, parent span ID, timestamps, and more
- Specific attributes of each span, like HTTP methods, URLs, status codes, and server names
This level of detail is beneficial for in-depth analysis, allowing you to thoroughly examine each aspect of the trace, from the high-level view of the request to the granular details of individual operations.
The exported trace will be downloaded in a JSON format and once viewed it looks something like this:
{
"batches": [
// Batch for the trace 'GET /create'
{
"resource": {
"attributes": [
{
"key": "service.name",
"value": {
"stringValue": "django-notes-app"
}
}
],
"droppedAttributesCount": 0
},
"instrumentationLibrarySpans": [
{
"spans": [
{
"traceId": "a4fcabb761c0bcb79f49462d317cb769",
"spanId": "d28cb2de926c9ee4",
"parentSpanId": "0000000000000000", // Root span with no parent
// ... additional span details ...
}
],
"instrumentationLibrary": {
"name": "opentelemetry.instrumentation.wsgi", // Instrumentation library
"version": "0.41b0"
}
}
]
},
// Batch for the trace 'note_create'
{
"resource": {
"attributes": [
{
"key": "service.name",
"value": {
"stringValue": "django-notes-app"
}
}
],
"droppedAttributesCount": 0
},
"instrumentationLibrarySpans": [
{
"spans": [
{
"traceId": "a4fcabb761c0bcb79f49462d317cb769",
"spanId": "29a715d4dba3c442",
"parentSpanId": "d28cb2de926c9ee4", // Parent span ID indicating this span is a child of the 'GET /create' span
// ... additional span details ...
}
],
"instrumentationLibrary": {
"name": "notes_app.views", // Instrumentation library for the view
"version": ""
}
}
]
}
]
}
Reconfiguring the Django Application
Until now, we can view requests as they flow through our application (traces), including timing data and interactions between different components or services in our Django application. We can now integrate logs and metrics into this setup to enhance our observability capabilities. This addition will enable us to:
- Send logs to our OpenTelemetry collector so we can analyze log data alongside trace data.
- Send metrics to our OpenTelemetry collector so we can monitor key performance indicators for a more comprehensive understanding of our application’s behavior.
Step 1: Cloning the Django Application
First, we need to configure our Django project to send logs and metrics to our OpenTelemetry collector in our Civo Kubernetes cluster.
Clone the following GitHub repository; the Django project has been configured to generate detailed logs using the OpenTelemetry Logging Instrumentation and a custom format that integrates trace and span IDs.
For metrics, it employs the OpenTelemetry Metrics API to track the number of requests it receives using a counter metric. This counter, named request_count
, increments with each incoming request to the Django notes-app
application, providing a straightforward yet effective way to monitor traffic load. The count data is then exported through an OpenTelemetry exporter to establish a robust framework for logging and performance monitoring of the Django application.
Step 2: Dockerizing and Deploying the Django Application to DockerHub
Once cloned, create a DockerHub repository, dockerize it, and deploy it to the new repository using the following commands:
docker build -t <your-dockerhub-username>/<repository-name>:latest .
docker push <your-dockerhub-username>/<repository-name>:latest
Step 3: Updating the Django Application Deployment
Now that we have dockerized the Django project and have pushed it to DockerHub let's update our deployment.
To begin, update the previous deployment’s image to point to the new Docker image using the following commands:
kubectl set image deployment/django-deployment django-app={your-dockerhub-username}/{name-of-your-image}
This will update the existing Kubernetes deployment with our new image. You should have the following output once the deployment is has been configured:
deployment.apps/django-deployment image updated
Confirm that the Django application is running using the following command:
kubectl get pods
Once it is running, you should have the following output:
NAME READY STATUS RESTARTS AGE
django-deployment-6c4c7d4bcf-lwx8v 1/1 Running 0 65s
Installing Grafana Loki
Having successfully configured our Django project to generate logs and metrics in addition to traces, our next step is to set up the infrastructure required for visualizing and analyzing this data.
We've already established a pipeline for forwarding traces from our OpenTelemetry collector to Grafana Tempo, which are then visualized in Grafana. Now, we'll extend this capability to include logs and metrics.
To achieve this, we'll first install Loki for log aggregation and Prometheus for metrics collection. These tools will serve as the foundational elements for our observability stack, allowing us to gain deeper insights into our application's performance and behavior.
Step 1: Configuring Loki Stack
When installing Loki Stack via Helm, it comes with a comprehensive stack that includes not only Loki but also Prometheus and Grafana. This stack provides an integrated solution for log aggregation, metrics collection, and data visualization.
However, for more granular control over these components, we will install them separately. Since we already have Grafana installed, we won't need to install it again.
Begin by creating a file named loki-values.yaml
. This file will host our custom configurations for the Loki stack installation.
Use a text editor to create this file and insert the following settings:
loki:
enabled: true
prometheus:
enabled: false
grafana:
enabled: false
These settings ensure that only Loki is enabled during the installation, while Prometheus and Grafana are not installed as part of this stack. This approach lets us maintain the existing Grafana setup and manage Prometheus separately.
Step 2: Installing Loki Stack
With the Loki Stack configured, we can now go ahead and install the Loki Stack using Helm with the custom settings created in the previous step.
Execute the following command to add the Loki Stack Helm chart repository:
helm install loki grafana/loki-stack -f loki-values.yaml
You should have the following output:
NAME: loki
LAST DEPLOYED: Thu Nov 30 05:49:24 2023
NAMESPACE: default
STATUS: deployed
REVISION: 1
NOTES:
The Loki stack has been deployed to your cluster. Loki can now be added as a data source in Grafana.
See http://docs.grafana.org/features/datasources/loki/ for more detail.
After running the Helm command, check your Kubernetes cluster to confirm that Loki is up and running:
kubectl get pods
kubectl get svc
You should have the following output:
# kubectl get pods
NAME READY STATUS RESTARTS AGE
...
loki-0 0/1 Running 0 20s
loki-promtail-tqghk 1/1 Running 0 20s
loki-promtail-5nsfv 1/1 Running 0 20s
# kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
loki-headless ClusterIP None <none> 3100/TCP 25s
loki-memberlist ClusterIP None <none> 7946/TCP 25s
loki ClusterIP 10.43.241.73 <none> 3100/TCP 25s
Installing Prometheus
With Loki configured and installed in our cluster, next up we’ll go-ahead to configure and install Prometheus. To achieve this, we will be using the Prometheus kube-prometheus-stack Helm chart.
Step 1: Configuring Prometheus
Before installing Prometheus, we need to create a job configuration that will allow Prometheus to scrape metrics from specific targets.
Create a file named prometheus-values.yaml
and paste in the following configuration:
global:
scrape_interval: '5s'
scrape_timeout: '10s'
prometheus:
prometheusSpec:
additionalScrapeConfigs: |
- job_name: otel-collector
static_configs:
- targets:
- opentelemetry-collector:8889
grafana:
enabled: false
This configuration does the following:
- Sets the global scrape interval to every
5
seconds and the scrape timeout to10
seconds. This defines how frequently Prometheus will collect metrics and the maximum time allowed for a scrape request. - Adds a new scrape job named
otel-collector
. This job is configured to scrape metrics from theopentelemetry-collector
service at port8889
. We will configure our OpenTelemetry Collector to expose this port later. - Set Grafana to
false
, indicating that we are not installing Grafana as part of this Prometheus setup, as it comes with the Prometheus kube-prometheus-stack.
Step 2: Installing Prometheus
After configuring the scrape settings in prometheus-values.yaml
, the next step is to install Prometheus in our Kubernetes cluster.
Begin by adding the Prometheus chart repository to your Helm setup. This ensures you have access to the latest Prometheus charts:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
Now, install Prometheus with Helm using the custom configurations you've defined above:
helm install prometheus prometheus-community/kube-prometheus-stack -f prometheus-values.yaml
NAME: prometheus
LAST DEPLOYED: Thu Nov 30 06:42:52 2023
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
kube-prometheus-stack has been installed. Check its status by running:
...
After the installation process completes, you can verify if Prometheus is running correctly using the following commands:
kubectl get pods
kubectl get svc
You should see something similar to this:
# kubectl get pods
NAME READY STATUS RESTARTS AGE
...
prometheus-prometheus-node-exporter-rblhc 0/1 Pending 0 2m10s
prometheus-prometheus-node-exporter-n7z8n 0/1 Pending 0 2m10s
prometheus-kube-prometheus-operator-7d89b9dd4d-h24fx 1/1 Running 0 2m10s
prometheus-kube-state-metrics-69bbfd8c89-xlnlk 1/1 Running 0 2m10s
alertmanager-prometheus-kube-prometheus-alertmanager-0 2/2 Running 0 2m7s
prometheus-prometheus-kube-prometheus-prometheus-0 2/2 Running 0 2m6s
#kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
...
prometheus-prometheus-node-exporter ClusterIP 10.43.81.61 <none> 9100/TCP 6m43s
prometheus-kube-prometheus-operator ClusterIP 10.43.252.136 <none> 443/TCP 6m43s
prometheus-kube-prometheus-prometheus ClusterIP 10.43.64.194 <none> 9090/TCP,8080/TCP 6m43s
prometheus-kube-state-metrics ClusterIP 10.43.60.6 <none> 8080/TCP 6m43s
prometheus-kube-prometheus-alertmanager ClusterIP 10.43.144.21 <none> 9093/TCP,8080/TCP 6m43s
alertmanager-operated ClusterIP None <none> 9093/TCP,9094/TCP,9094/UDP 6m39s
prometheus-operated ClusterIP None <none> 9090/TCP 6m38s 6m38s
Step 3: Creating a Service Monitor
The objective is to enable Prometheus to scrape metrics from our OpenTelemetry collector instance, allowing us to view these metrics in Grafana. To achieve this, we need to create a Service Monitor, a Kubernetes resource used by Prometheus to specify how to discover and scrape metrics from a set of services.
Create a file called service-monitor.yaml
and paste in the following configuration settings:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: otel-collector
labels:
release: prometheus
spec:
selector:
matchLabels:
app: opentelemetry-collector # Ensure this matches the labels of your OpenTelemetry Collector service
endpoints:
- port: metrics # The name of the port exposed by your OpenTelemetry Collector service
interval: 5s
This configuration sets up a service monitor called otel-collector
. It has a label prometheus
, which in this case is the name of our Prometheus Helm release.
The service monitor is set to look for the OpenTelemetry Collector, which we have named opentelemetry-collector
. It checks the metrics
port of this collector every 5
seconds. This port is where our application's metrics will be available, and we will set this up later.
Now run the following command to create the service monitor:
kubectl apply -f service-monitor.yaml
kubectl get servicemonitor
You should see the following outputs:
#kubectl apply -f service-monitor.yaml
servicemonitor.monitoring.coreos.com/otel-collector created
#kubectl get servicemonitor
NAME AGE
prometheus-prometheus-node-exporter 12m
prometheus-kube-prometheus-operator 12m
...
otel-collector 61s
Next, access the Prometheus UI on your local machine. This will allow us to confirm that it has picked up the otel-collector
service monitor we just created. On your machine, run:
kubectl port-forward svc/prometheus-kube-prometheus-prometheus 9090:9090
Head over to your browser and visit the address - localhost:9090:
Click on the Status dropdown, and select Service discovery.
You should see the otel-collector
listed as shown below:
Updating the OpenTelemetry Collector
Now that Loki and Prometheus are configured and installed, we need to update our OpenTelemetry Collector configuration to forward logs to Loki and metrics to Prometheus.
Navigate to your OpenTelemetry Collector configuration file and add the necessary exporters for Loki and Prometheus:
#collector.yaml
...
exporters:
debug: {}
otlp:
endpoint: grafana-tempo:4317
tls:
insecure: true
loki:
# Loki exporter configuration
endpoint: http://loki:3100/loki/api/v1/push
prometheus:
# Prometheus exporter configuration
endpoint: 0.0.0.0:8889
service:
pipelines:
...
metrics:
receivers: [otlp]
processors: [batch]
exporters: [debug, prometheus]
...
Now upgrade the OpenTelemetry collector chart using the following command:
helm upgrade opentelemetry-collector open-telemetry/opentelemetry-collector -f collector.yaml
Next, execute the command to edit the OpenTelemetry service.
This step is necessary to add port 8889
to the list of ports exposed by the OpenTelemetry collector service.
By doing this, Prometheus will be able to access and scrape metrics from the service.
kubectl edit service opentelemetry-collector
This will open up the service manifest in a Vim editor. Scroll down to the last option in the ports
section of the service specification. Press i
to enter insert mode and type in the following:
- name: metrics
port: 8889
protocol: TCP
targetPort: 8889
Once added, exit the insert mode by pressing the Esc key. Then, type :wq and press Enter to save the changes and exit the OpenTelemetry collector service manifest file.
You should have the following output:
service/opentelemetry-collector edited
Confirm the port 8889
is actually exposed using the following command - kubectl get service
. You should see the port 8889
listed among the exposed ports like so:
opentelemetry-collector ClusterIP 10.43.73.178 6831/UDP,14250/TCP,14268/TCP,4317/TCP,4318/TCP,9411/TCP,8889/TCP 108m
Head back to your Prometheus server UI, navigate to the targets option from the Status dropdown; You should see that the otel-collector
service monitor is active and up as a target:
This confirms that Prometheus has been configured correctly to scrape metrics from our Opentelemetry collector.
Viewing Logs and Metrics with Grafana
Up until now, we have successfully set up an infrastructure that sends logs and metrics to Loki and Prometheus. At this point, we are ready to view these components through Grafana.
Step 1: Adding Loki as a Datasource
To begin viewing logs in Grafana, you first need to add Loki as a datasource.
Navigate to the settings icon on the left panel and select Home.
Click on Add your first data source, search and choose Loki from the list of available data sources.
In the Loki data source settings, enter the URL of your Loki service - http://loki:3100
. This is usually something like http://<loki-service-name>:3100
Save and test the data source to ensure Grafana can connect to Loki.
Once connected, head over to Explore and select Loki as shown below 👇
Add the following label filters container
and django-app
and click on the Run query button:
You should see the following output:
This confirms that Loki is receiving logs, and based on how the Django application logging instrumentation is configured, you see the date and time the logs were generated and the TraceIds and SpanIds in every log related to the Views in the Django application.
By clicking on the logs, you get to see the label of the Django application which in this case is called django-app
, the container django-app
(just as it was specified in the deployment manifest for the Django application), the job representing tasks, the namespace representing the Kubernetes namespace in which the application is running.
Additionally, you will see the name of the node, indicating the specific server in the Kubernetes cluster where the pod is hosted, and the name of the pod, which is the smallest deployable unit in Kubernetes that contains the Django application.
From here, you can download the logs either in a .txt or .json format to have a complete view of what the logs comprise of:
Step 2: Adding Prometheus as a Datasource
Just as we did for Loki, we need to add Prometheus as a data source so we can view metrics generated by the Django application:
Follow the steps used in the previous step to add Prometheus as a data source. Use the following endpoint prometheus-kube-prometheus-prometheus:9090
in the Prometheus data settings.
Once you have successfully added Prometheus (Prometheus server) as the data source, head over to explore and select Prometheus.
Before we begin to view metrics, there are some things you should take note of:
- The Django application was instrumented using a counter metric. A counter is a simple metric type in Prometheus that only increases and resets to zero on restart. In our case, we've used it to count the number of requests the Django application receives. This gives us a straightforward yet powerful insight into the application's traffic.
- Each request to the application increments the counter by one, regardless of the request type (GET, POST, etc.) or the endpoint accessed. This approach provides a high-level overview of the application's usage and can help identify trends in traffic, peak usage times, and potential bottlenecks.
- When viewing this metric in Prometheus or Grafana, you'll see a continuously increasing graph over time, representing the cumulative count of requests.
Select the label filters exported_job
and django-notes-app
, click on the metric dropdown, and select request_count_total
as shown below:
Once you click on Run query you should see the following:
When you run the query, you'll see a graph showing how many requests have been made over time. You can also select individual requests for a detailed view. Each request on the list is color-coded, making it easy to match with its corresponding graph.
Select the first request from the graph section; the graph will focus on that specific request and stop at its total count, as shown below:
From the image above, the first request was selected, and the graph stopped at the total count of that request which is 4
.
We have successfully generated metrics in our Django application, routed them to our OpenTelemetry collector, and configured Prometheus to scrape them. Additionally, we can now view these metrics in Grafana.
Troubleshooting
In any complex setup like this, you might encounter issues. Here are some common troubleshooting steps:
- Incorrect configurations are a common source of problems. Double-check your
collector.yaml
, service manifests, and any Helm value files you've used. - Ensure Prometheus is correctly discovering and scraping targets. Access Prometheus UI and check under Status → Service discovery or Status → Targets.
- Verify that the data sources in Grafana are correctly set up and can connect to Loki and Prometheus.
- If Prometheus isn't scraping metrics as expected, verify the configuration of your service monitor. Ensure the labels and selectors correctly match your OpenTelemetry Collector service. You can also use
kubectl describe servicemonitor otel-collector
to view detailed information about the service monitor.
Summary
Through this guide, we've taken a deep dive into setting up a comprehensive observability stack for a Django application pre-instrumented with OpenTelemetry running in Kubernetes. By integrating Grafana Tempo for distributed tracing, Loki for logs aggregation, and Prometheus for metrics collection, we have created a robust environment that tracks and visualizes aspects of our application's performance and health.
By completing this tutorial, you're well on your way to mastering Kubernetes-based application monitoring and troubleshooting. Keep experimenting and learning to harness the full potential of these powerful tools.
Further Resources
If you want to learn more about this topic, here are some of my favorite resources:
- The OpenTelemetry Docs
- Prometheus Configuration Docs
- Loki-Stack Helm Chart Repository
- Henrik Rexed Navigate Europe 2023 talk on The Sound of Code: Instrument with OpenTelemetry