The Sound of Code: Instrument with OpenTelemetry

Transcription

All right, so we're going to talk about the sound of code. So, you'll see the code has a sound, especially if you instrument it with Open Telemetry. So, how many of you have heard about Open Telemetry here? All right, are you using it? Yes, here? All right, all right. So, let's go on this topic. Oh, interesting. Okay, so first of all, my name is Henrik Rexed. I'm a Cloud Native Advocate at Dynatrace. It's been two years, and I'm working at Dynatrace. Prior to that, I've been working as a Performance Engineer. So, I'm pretty much dedicated on performance. So that performance is still in my heart. So that's why producing content for Performance Engineers in a channel called "Perfect." And one year and a half ago, I started a professional YouTube channel for those who wants to know more about Observability called Is it observable? So, check it out, it's out there. And, to improve the content, you need feedback. So check it, share the feedback, if I can somehow improve my content.

All right, so, that's quite painful. So what are we going to talk about in the next 25 minutes? A couple of things. So first, for those who never heard about Open Telemetry, I'm going to do a couple of reminders. What is Open Telemetry? The various components involved in Open Telemetry. How you can produce traces, metrics. And then, you will see that because you're gonna, you have the freedom to build your own custom data. You also have the need of validating that your signals, the metrics that you're building, are good enough to be able to troubleshoot in production. And last, we'll see how we can continuously improve the signals at the end of this presentation.

So a couple of years back, I'm pretty sure that you all know, we all were relying on terms called APM, Application Performance Monitoring. And we were working in a fixed environment with virtual machines or physical servers, very easy to manage at the end of the day. And most of the solutions out on the market, Dynatrace is included, we were injecting an agent. And with that agent, we were able to get data from the host, the processes, the application. Some applications were even producing traces out of that. And with that, we were able to get a dashboard, some alerting. So that was beautiful. But that was in a fixed environment. And then, one of the features provided by the solution was profiling. I don't know if you've been using profiling, but profiling is super powerful, especially when you do code. There were solutions that were pretty much smart, where you didn't have to configure anything, you get profile data out of the box. So you were very happy. Other solutions where you had to basically configure everything. Which classes, which functions, which... So basically, you were spending hours to figure out what the right settings to get the right information.

But even with this, our applications were basically blocked. So using those applications, they were very difficult to use. So we were never using in production because they were so slow. That we were mainly using for Dev, especially if we have bugs. But not in production. No, no, no. So that's one thing. So we were using APM, but we didn't have the options to use properly the observability solutions that we were having at that time.

And something happened one day. Most of us moved to the cloud. So we are not having a fixed environment anymore, we have ephemeral environments where we have a pod or a node for a few hours and then it's gone. So the challenge of deploying agents is more difficult, it's another world. And the other also thing that happened is the methodology. We're using Agile, DevOps, and part of the methodology, you need observability. So it's even more critical to get observability to do the proper automations out of your applications. So using APM at that time is not possible. So we had to think differently.

So what is the solution? Well, observability of course. So observability, what it is? For those who never heard, I guess you know the terms because it's a buzzword. But the idea is that we need to have the right information to understand precisely what's going on in your environment. And for that, we need to rely on different signals. So you have logs, we know that's events. So Kubernetes generates tons of events, but also when you build your platform, when you release your solutions, you also have events. So you can also send those events back to your observing solution. So then, you have more context to a given situation. Metrics, of course, traces, profiling, I'm mentioning it, profiling will be a really rich source of information to understand what's really happening within your code. But having this is great, but without any context, it doesn't make sense. So we need to know, attached to those signals, we need to have which technology it is, which version, which host, which geo is this pod or node or service is running from. So the context is more than important because it gives you extra information, and it helps you to troubleshoot.

So the natural reaction, if I want to do observability, I go to CNCF. And if I go to CNCF, I got a beautiful matrix about observations. I see a lot of commercial vendor solutions, lots of open source. But the reality is different. In fact, if you go to CNCF, there's tons of projects for any kind of solutions. So it's more difficult. So let me walk you through the solutions that you should know about.

So first, there are, if you look at CNCF, there are solutions like Jaeger, Grafana, Tempo, Prometheus, Fluentd. But what we can see here is that we have solutions but they are pretty much dedicated. So we have specific solutions for tracing, Jaeger, Tempo. We have specific solutions for profiling. So, that's the first step. And then, to build traces, you will attach to your OpenTelemetry API a trace provider and a propagator. The trace provider will have several configurations, so the span processor, the sampler, the exporter, and the resource explorer. The exporter is basically where you want to export the telemetry data that you're producing. Once you have defined that, then, no problem, I can build spans.

So, what is a resource? Resource is the identity of your service: the service name, the namespace, the version, whatever... every information that will make the identity of the service. In fact, most of the observing solutions of the market rely on the resource to make the right naming in the UI of those backends.

Span processor, what is it? So, in OpenTelemetry, there are two types of span processor. The span processor will determine how it will communicate the telemetry data back to the exporter. So let's take an example: Simple span processor. Simple span processor is designed for live data. So if it means every time there's a span that ends, it will be sent to the observation. So, great. But you can imagine that it will produce lots of network communications. So it could be very expensive in terms of network. So that's why there's the other one called batch span processor. Batch span processor will reduce the communications, so it's more designed for production. You have less live data, but the idea is that every time you have a span, we set it in the batch and then every thousand telemetry data will be sent to the observability solution.

The second aspect that you need to configure is the sampling decision. As of now, you are the one that will configure how much data you want to send to the observing solution. OpenTelemetry provides different sampling decisions. So, we have Always-On, so 100%, whatever is produced will be sent. So that could be very expensive at the end. You have the Always-Off, nothing. Parent-Based means, here, I'm the service B and I would only send traces or spans if I'm involved in a transaction. So it's very useful when you have a dependency, for example. Then you have the Trace-ID-based Ratio, where you determine a percentage: "Oh, I need 20% of traces here, I need 10% here, 5% here." So you make a ratio, and this will determine how much information you will send over to your observability solution. But it's very important. This is crucial, because, at the end, depending on the sampling decision, you will have less information that you will use for production usage.

So, if you have 1000 requests for this example, with 50% on A, 25% on B, then the consequence of that, if I want to have a full end-to-end trace between A to the service E, I will have only one request out of those 1000 requests that will be fully instrumented. So again, sampling decision is great, but you need to configure it properly.

Then you have trace propagation. So what is it? Trace propagation will determine how the information will be sent to attach those traces together. So if I'm a first service, the first frontend, I start this trace. So here, from the moment I start the service A, I have the context. So I can attach all the sub-transactions to that. But how can I basically attach the rest of the transaction on the service B? This is where we call propagations. But before I'm sending the actual call to the service B, I will inject the context. And then the service B, the first thing is going to do is going to extract the context. And then all the different steps will be attached to the exact whole trace.

So why am I saying this? Because in the trace propagation, there are several formats. And why? Because before OpenTelemetry, we had OpenTracing. So we have the one from OpenTelemetry, so W3 Trace Context and W3C Baggage. And then you have the B3, B3 Multi. Those are the ones from OpenTracing. And the problem is that if you don't use the same language, then the service doesn't understand each other. So basically, you lose details. So make sure to use the same trace propagation.

On the metric side, it's very, very exciting, because now, we have, because it's vendor-agnostic, I can build my own metrics right out of my code. I can measure the number of cart products that have been selected by your customer because I control that metric. So how can I produce my custom metrics? Very easily.

I add the OpenTelemetry API, I will set a meter provider. There will be a meter reader that will basically collect metrics and then send it to an exporter. And then once I have defined this, I simply can create my own metrics.

Similar to Prometheus, the OpenTelemetry project has basically covered the same type of metrics. So, you have counters. So to remind, counters are metrics that always go up. So it's perfect for example the number of HTTP requests coming in your service. So then you can have the HTTP requests per second for example. What OpenTelemetry has introduced is up-down counters. So you can add positive values but also negative values. So it can go up and down.

And you have gauges. Gauges, you set a value. So it's perfect for CPU usage, for response times, for memory heap, and so on. And then you have histograms. Histograms are perfect for, let's say, analytics. I want to have a P90 out of my response times, and then for that I will use histograms.

So, the collector, this that's just the main one, one of the component I mentioned before. The collector is very important because you will see, you're probably going to use it heavily in your environments when you deploy. The collector, you can deploy it in various ways. You have the agent mode which is the most recommended way. One, because at the end, it's close to the code, so only localhost communications, so you don't consume network bandwidth. In Kubernetes, you can deploy it in a daemon set, so one every node will have its own collector.

And the last one is, if not recommended, if you have no options, go there, it's the gateway. So, one big collector that will receive everything. It would be very hard to maintain it over the time. So, what is happening today, you don't rely on one collector, you rely on several collectors that will do part of the transformations. So, you will have, here in the red, I have a sidecar container in my workload and those are collectors. Which means, my app is producing telemetry data, I'm sending it locally to the container sitting next to me. So, I can do some minor transformations and then I send it to another collector, the blue one here, that will do some extra transformation.

And then, what is beautiful is that I don't touch my code with any vendor library. If I need to change, today I'm using Grafana, and tomorrow I want to use Dynatrace, the only thing that I change is one collector settings. So, it's very easy and very convenient. A collector is similar to log agent Fluentd, Filebeat, I don't know if you have been playing with that. You need a pipeline. So, collector will have a pipeline that needs to be designed where you define what I'm going to receive, metrics, traces, and so on, from which destinations, how I'm going to transform to the processor, and last, where I'm going to export that information to which observability solutions.

So now, you have most of the background, and you will see that we use this telemetry data for production usage but also for automation. That's why you need to make sure that the quality of the measurements that you produce are reliable and efficient when you need to troubleshoot. So, if I build metrics, I need to make sure that I have the right dimensions, the right labels, so I can split by pod or by service, or by version number, whatever you want. But you need to make sure to have the right dimensions, otherwise it won't be usable.

On the traces side, same thing. If I have a huge transaction, maybe it would make sense to create events. So then, you have a set of sub-tasks, you know where you're spending time and you have more details. On the logs, same thing. You need to add dimensions. So then, you can index the information and then cross-correlate it between metrics, traces, and logs.

But at the end, keep in mind that your app will be destroyed. There will be a murder, someone will kill your application. And you will need that telemetry data to be able to profile and figure out who is responsible for that outage. And you need to have the right telemetry data for that.

So, for example, another example, this is the second phase, the observability challenges is that at the moment we have two types of personas involved. We have the producer on one hand and on the other hand, you have the consumer. And the problem is that we have two different objectives. The producer, usually I'm building my code, I want to debug. So, when I'm debugging, I probably want to have live data. So I will use simple span processors and Always-On. So then, I will have as much details as possible.

The problem is that I'm working on my own microservices. So, at the end, I'm losing the scope, the broader scope of the application. So, maybe not standardizing the labels, the dimensions that I'm producing, so then the data won't be so much efficient, for example, here I'm building a trace and suddenly I see, 'Oh, I'm spending a lot of time on that blue line.' Why is that? Maybe, is it really my application or is it the fact that I didn't build the right image data? So then, what I did, I reviewed my Open Chemistry spans, the way I produce them, and I transform into this. So, at least, I have more details. I know that I'm not spending 20 or 200 minutes, 15 milliseconds in that service. I have more steps and more fine-grained view on what's going on. So, basically, when you produce traces, you have to look at what you actually get out of the temperature data produced.

Now, you have the consumer. On the other hand, the consumer is going to be global, so first, she's going to match production. So, he wants enough data but for the right costs. You have too much information, that's great, but I will pay for that information. So, I need to figure out the right balance. The second thing is because I'm consuming things, I will probably build dashboards for the amount of organizations, so people can be smart enough, so I want to be able to build one dashboard for everyone, so be able to be generic. So, that's why I need to make sure that everyone is using the same standards in terms of naming, in terms of labeling, and so on. So, the idea is that I am putting the guidelines, but then, I'm not the one adjusting the instrumentations. So, this is a challenge, but again, with the right communication between teams, this could be easily resolved.

So now, the question is, how can we continuously improve the sound of our code? So, without spending too much time and effort in building those traces and metrics. Well, for this, there's a great news if you're using Ingress in Kubernetes or if you're using service mesh, most of them are supporting distribute tracing. Which means, out of the box, if I enable that, I will have all the communication, to communications between my services and at least I'll have a service A, service B, service E, and I have at least that details. Then, on each service, I simply need to enable auto transmutation. So, I first have the network layers and then inside of those boxes, I will have the details of what's going on in the services. So, that that is done without any change on my code. And then, I look at the produced signals and if I don't have the right information, then here, I add manual instrumentation.

So, takeaway, first, make sure to make sure that your code is agnostic. Don't rely on vendor libraries because then, if you change solution in the future, it will be just difficult to make the change because you need to remove that library out of the code. So, Open Telemetry is there to help that, to go on that direction. Second thing, on ability layer, make sure to have metrics, logs, and traces with the right context. With more information and more context attached to your temperature data, the more efficient you would be when you need that information to troubleshoot or to create alerting rules. And then, on the how to produce damage data. So, first, adjust the settings between the sampling decisions, the span processors to get what you need in terms of data without, in and without having too much expense in terms of storage. Use automatic instrumentation, most of the framework today produce lots of good details. And then, from there, figure out if the details produced are good enough. And if not, then figure out if metrics could help you as well. So, try to look at it, don't take what you produce as a final version. You can improve it, because at the end, you will, if you have a production outage and you realize that it's not helping you to troubleshoot, and then you can ask yourself, 'That maybe you have to change things.'

So, last few sentences, check out my YouTube channel and there is a lot of content on Open Telemetry, for ND Linker D, whatever, and the idea is it's out there to help. And, of course, I need feedback to improve my content. So, check it out and send me your feedback, that will be very helpful. Thank you. And, if you have any questions, I will be very delighted to answer your questions.

Thank you. How light is this, uh, framework altogether in terms of memory utilization, network consumption, and CPU? Good question. So today the instrumentation library is not producing too much. I mean, if you use the auto instrumentation, it's going to be very overrated. It's very small. Even the Vander library today is about one to five percent of overrate and Open Telemetry is going to be the same thing. Again, like I said, it really depends on the quantity of data they produce. If you do always on and you send hundreds of transactions to your observation solution, you will consume more network and then you will probably going to rely on a collector. And the collector will be your main point of concern because if your collector goes down then you have nothing coming in, in your absolute solutions. So the idea is that you need to first reduce but then you have to find to in tune the settings. By the way, I didn't mention but there's a project coming on called the Op-amp project. At the all what I said the configuration will be done automatically in the future. So it's moving to a better landscape. So be patient and the next coming year the month will be even easier to onboard in Open Telemetry.

Is there any other questions?

Um, hello. Yeah, it's I... I just... I just got here and I don't really, um... I'm new to this. So is there like an analogy that you can make for me to understand? The idea is that, uh, today if you are running some code, you want to understand why is it slow then Open Telemetry will help you to do that. So because the idea is that today we have observing solution that are required to add vendor libraries in your code to build your custom damage data. With Open Telemetry, it's an open source project. It's agnostic which means you put your library in the code and with that library you're you're not relying on any vendors. You can produce your custom metrics, your custom logs, your custom traces without any vendor. And then of course you need a vendor to display and make the beauty of the data. Here we're producing metrics and data. We are not making the value of bloating and making beautiful dashboards. Keeping that in mind, you are the one in charge of the data and then you need a solution that makes beauty out of the data. Um, I see. So it's like a debugging tool. It's more than debugging. It's for production. If you do automate, if you do testing, if you want to have a quality gate that will collect information from the code and say, 'I'm good to go for on my testing processes' based on those and those metrics, you can do that by relying on standards. So it's not about just debugging, it's about production. Monitoring, looking at what's happening on your environment, automating your tasks. There is plenty of use cases today with observability. I see. So you, um... you... you basically see what happens under the hood, right? Yeah, I mean, probably you knew profilers. If you code, you knew profilers. No, you... No, no, I'm still new to all this. So a profiler will be something that will sit in your code and will basically give you all the instruction of everything that happens in your code until the kernel instructions. So if something is slow, you see directly, 'Okay, I'm slow here because I'm spending, uh, 80% of my CPU time on the get profile', for example, 'get product function'. So I know what I need to optimize. I don't have to search in my code anymore. I just see from a UI perspective that, okay, so if you... I need to be smart, as faster and better for the customers, this is what I need to optimize. I see, thank you. Thank you so much.

Okay, sorry. Yeah, question. Is there anything that you can say to, um, make me choose Open Telemetry versus Jaeger, for instance? They seem to be really similar. You can do so. Jaeger... Jaeger is just a... Jaeger's support of entire Jaeger for me, it's a tracing visualization tool. Okay. It's not... Jaeger is not helping you to instrument. Open Telemetry will be the library that you will put in the code that will produce the spans and the trace and the metrics and then you need to send it somewhere. So I can, if I want, I can send it to Dan Trace, I can send it to Grafana, and I can send to Jaeger. So yeah, I need something that visually helped me to see what I've produced. So I would say that Jaeger is the UI, Open Telemetry is the technology that helps you to produce the stuff. Yeah, question. What languages is it support or compatible with, like Java? So today there is, uh, in terms of support, there is Java, C++, Node.js. So you're splitting Node, JavaScript into two projects. One from the browser level and one from the app level. There is .NET, there is Airline, there is Go, there is Rust. There is almost 12 and more languages that are produced and there is also a project that do CLI. So if you do batch and you want to instrument out of your batches to say, 'I want to have the steps for my, my batch script script', you also have a library to do that. Great, thank you. One last thing, um, this is kind of maybe a dumb question, but from a BA perspective, if my developers are using, you deploy, does this... is this a similar tool or does it integrate or not at all? Which one? You deploy. You deployment, it's nothing. I mean, here, you deploy, we'll just deploy stuff. Here, it's just that it won't give you much information. What you want, for example, I am a website and I want to keep track, for example, of the number of products on each category and I want to build a metrics and there is no actual exporter of the market that will help me to build that custom metrics. So now, if I'm a coder, I can go to the cart service and I say, 'Hey you guys, can you build that metric?' And now you can do that which means now you can have a business KPI, they can... You can follow over time and look at that, for example, the number of arts, uh, product that people are adding the cards. You can do that in a very code level perspective. So it's... it's keep in mind that Open Telemetry is about producing metrics and measurements and then you need a solution that will visualize it.

Any other questions?

Right, thank you.

Thank you.

Share on Reddit Share on X Share on Facebook Share on Linkedin

The Sound of Code: Instrument with OpenTelemetry

Summary

Transcription

Kubernetes at the Edge with Portainer

Managing Kubernetes Log Data at Scale

Turning Logs into Serverless Estimates

The Sound of Code: Instrument with OpenTelemetry

Summary

Transcription

More talks like this

Kubernetes at the Edge with Portainer

Managing Kubernetes Log Data at Scale

Turning Logs into Serverless Estimates

Stay up to date