Architecting a Data Infrastructure with Kubernetes
Speaker: George Trujillo
Summary
In this talk, George Trujillo, an experienced enterprise architect, explores architecting data infrastructure with Kubernetes. Learn about real-time AI, data strategies, machine learning, and the crucial role Kubernetes plays in deploying models efficiently. Hear firsthand stories of his journey to stateful solutions and managing business requirements. Discover insights on technical debt, strategy drift, and how to improve the speed of organizational execution.
Transcription
So, hey everybody. Thank you for joining me today. My name is George Trujillo and I'm going to be talking to you all about architecting a data infrastructure with Kubernetes. For basically about the last 10 years, I've been building out real-time architectures for companies as a master principal Enterprise Architect and as an executive leading the data initiatives. So, what I'd kind of like to do is really just share with you some of the things I've learned in that process and I think my real goal for this is to get you to look at data in a different way. That I think will help you be more successful when you're executing your data strategies. And the way I'm going to do this is talk about how real-time AI is driving data to Kubernetes.
The reason I'm doing that is because real-time AI is happening significantly faster than the type of AI that you do in your data warehouses and your analytical platforms. So, the speed of change is critical in real-time AI and you have to get faster. Most companies are not doing it fast enough. I'm then going to talk about how Kubernetes should be driving your data execution and give you some examples of that. And then, I'm going to show you some examples of stateful data solutions.
First of all, I want to set some context. When I talk about real-time, I'm not talking about three microseconds or milliseconds. Real-time is essentially a decision that has to be made in a certain time window, and dependent upon your use case, the time windows can be different. Second, when I look at operational data, a long time ago, we set these things called operational data stores and their purpose was basically to integrate data from multiple data sources. If you want to be able to make real-time decisions, you have to correlate and aggregate data from a lot of different sources. So, it's not about you have database A, database B, and database C. Whoever's going to win this race integrates data faster and better than their competition.
And machine learning, when I talk about AI, I'm talking within the context of machine learning in terms of it's a subcategory of AI and it uses models. And what those models really are, or their algorithms or programs, to make decisions. So, when you think about what is our strategy for working with machine learning, you need to take a step back and look at it from a little higher perspective and understand these are programs that we have to deploy and we have to deploy them accurately and with speed. How can we get better at doing that?
And then, I'm also going to be talking about machine learning data. And this is features. And features of the data that machine learning algorithms use to make decisions. Now, features are used in two different ways. First of all, they're used to train the models but then at the same time, when you put a model in production, it has to look at data in real time to make those decisions after it's been trained. And so, when I refer to that, that's inference data. So, machine learning algorithms need high-quality, valuable data in development but it could also be different data that they have in production that they have to work with.
So, real-time AI Brands, right? But it's exploding, and I see this every single day. I was talking to a VP of IT for a smart airport, or he told me he was building out a smart airport. And I go, 'What are you trying to do?' He goes, 'When you look outside the terminal and you look at the planes, look at all the moving parts, look at all the equipment. What we want to do is we want to put sensors out there and we want to reduce our operational costs and we want to increase our efficiency.' So, does it kind of make sense if you wanted to do that for an airport, you'd have to put sensors out there? Now, think about that. If they put 100 sensors out there to help them, what's the probability that they're going to grow to 500 sensors, to a thousand sensors, to 10,000 sensors?
So, when you look at anybody that's building out intelligent devices in IoT, they understand it's exploding. And how we're going to manage all of that data? So, I'm seeing companies that they're innovating as they're growing. I know an IoT company that started out just monitoring heat in buildings to reduce costs for companies. Then they go, 'Wait, if we can monitor that, can't we do lighting, can't we look at cooling, can't we also do security, can't we do RFID badges?' So, their company's evolving with every new device that they're putting out there.
So, we need to understand all those sensors, all those intelligent devices, all the wearables. They have to send data somewhere and they're going to do that in real-time streams. So, if you think about that, everybody's talking about the growth of the cloud. Now, that airport, for example, when they're streaming all that data, what's the probability that they're going to get those hundreds or thousands of devices and they're going to stream that data to the cloud? Or, what's the probability if they're going to be streaming it locally and then that local data is going to separate the signal from the noise and only sends certain data to the cloud? So, whenever you hear people talking about edge computing, you can see from just these few examples, it's going to explode. And edge computing, for the most part, is a hybrid strategy. You're going to want to send all those streams locally and then determine what you're going to send to the cloud.
So, if we're going to do all this, what are our requirements? What conditions do we have to meet? So, you can see, scalability is going to be a key, right? And it's not just that we need to scale and grow at volume, we need to scale up and down dynamically. How are we going to address data quality? How are we going to address the speed of innovation, the speed of change management, the speed of data integration, the speed of deployment of machine learning models?
Because, think about it, when you make decisions in real time, you can't take them back. If you decide based upon a condition to send out a coupon to 20 million customers, you can't undo that, right? Or what if you decide to turn off a valve in a power plant, or turn one on? There could be ramifications. That you can't undo that decision. Or what if you determine whether to send an alert, or a medical alert? You can't do that, or undo it, right?
So everything that you're doing with your data today, you have to do it much faster, if you're going to be competitive. And if you're not faster, with quality, I doubt you're going to succeed. And what I'm seeing is, when I work with customers, is that about 10 or 15 percent of the companies I work with, and I'm talking like Blue Chip companies, are doing this really, really well. And there's a bigger and bigger gap of the companies that aren't doing it well. Because most organizations haven't had to work at that level of speed and that level of quality.
So, how are we going to address this? So, all the sessions that you've gone to today, they've talked about Kubernetes, on how it allows applications and microservices to scale up and down dynamically, correct? How to do resource management, how to do load balancing, for integration. It's self-healing. If you're running Kubernetes and an application fails, it can self-heal itself, correct? So, is everybody pretty clear, in terms of your background or the presentation that you're seeing, the importance of Kubernetes in implementing microservices?
Because if you're going to make changes, not once a quarter, but once a week, or once a day, or multiple times a day, you have to work with speed. And you can't do it consistently, unless you have some tool like Kubernetes, right? So that's... that's square one. That's Table Six.
So, data has to work well together. I love hearing this joke, but I also hate hearing this joke, where you've all heard this, data scientists spend 80 percent of their time wrangling with the data, and only 20 percent of the time doing data science. If you think about that, that's a data problem. That's a data modeling problem. That's a data integration problem. That's a data architecture problem. That's not being solved. So how do we take care of the big rocks first?
You're going to have to deal with streaming data. What is our strategy for data ingestion? And if it's going to potentially grow exponentially, how do we handle that growth, with speed? How do we absorb it?
Operational data, this is all the data that you're integrating from different sources, to be able to make decisions. You cannot make decisions without getting a lot of different data together. Now, if you're doing things in real time, what's the probability that that data, and the characteristics that you need to make a decision, is going to change?
The characteristics are going to change constantly, correct? So, if your models are making decisions, what if you have to be able to make a model change quickly? Can you do that? I know companies that spend weeks and months coming out with a new model, or making updates to it. What about when you have to do that faster? And the third, is all of your data, that your models use, all the features that you use to train those models, or the inference data that you have in production, all three of these types of data have to come together. And they have to be cohesive. So, it's not, can you build this, how rapidly can you evolve it and change it, to keep up with the speed that your business needs to run at?
So, we're going to talk about data execution with real time. So, here's one of the big problems. Every company I go to, they've got a VP of software engineering or software development, they have a VP that manages their databases and manages their data SRE teams, they've got a VP that does data warehousing, correct?
And we all talk about the issue with silos, we all live and breathe it every single day. Here's the problem with that mentality that we've all had. Data doesn't work in silos, especially not real-time data. And everybody's going to come to that realization, sooner or later, when you can no longer keep up with the speed. Because, think about it, what does data do? It comes from these sources, and it gets ingested by some type of ingestion platform, correct? And then that data flows into an operational data store, or databases for persistence. Then that data flows into data warehouses. So, you may have your silos, but the data is flowing through that ecosystem. And then, for real time, the data flows back. Because for your models to make decisions, that inference data, it's in memory caches, it's in dashboards. So, as you're retraining those models, the data's got... some of that data has to flow back, correct?
And here is how we have to start changing. We have to look at a data ecosystem. How it lives, how the data breathes, and we have to start building our strategy, for thinking of data as a flow, of a data supply chain, that flows back and forth.
So, all of you today, have been going to a lot of Kubernetes presentations, and you're seeing the importance of using microservices for applications. So, every reason that you would want to use Kubernetes for applications, can you see, if machine learning models are really programs, that have to be deployed, and have to be deployed consistently, and they have to be deployed properly, wouldn't it make sense to use Kubernetes for that? Some of the top data scientists that I've worked with in my past, I call them periodically, "How often are you using Kubernetes?" And they'll say, "Extensively. We could not deploy models without using Kubernetes."
Well, I did not come up with this idea, like, "Oh, this is a really cool idea." I've got the scars on my back, that I used to figure this out, to solve problems. So, one of the problems I had, is if you're building out high growth apps, and those apps are generating real-time data, it has to go into data streams, correct? And if you're using microservice to deploy those high growth apps, don't your data streams have to work with the same speed and agility?
"So, if all this, what are our requirements, what conditions do we have to meet? So, you can see, scalability is going to be a key, right? And it's not just that we need to scale and grow at volume, we need to scale up and down dynamically. How are we going to address data quality? How are we going to address the speed of innovation, the speed of change management, the speed of data integration, the speed of deployment of machine learning models? Because, think about it, when you make decisions in real time, you can't take them back. If you decide based upon a condition to send out a coupon to 20 million customers, you can't undo that, right? Or what if you decide to turn off a valve in a power plant, or turn one on? There could be ramifications that... you can't undo that decision. Or what if you determine whether to send an alert or a medical alert? You can't do that or undo it, right? So, everything that you're doing with your data today, you have to do it much faster if you're going to be competitive. And if you're not faster with quality, I doubt you're going to succeed. And what I'm seeing is, when I work with customers, is that about 10 or 15 percent of the companies I work with... and I'm talking like Blue Chip companies, are doing this really, really well. And there's a bigger and bigger gap of the companies that aren't doing it well, because most organizations haven't had to work at that level of speed and that level of quality. So, how are we going to address this?
So, all the sessions that you've gone to today, they've talked about Kubernetes, on how it allows applications and microservices to scale up and down dynamically, correct? How to do Resource Management, how to do load balancing for integration... it's self-healing. If you're running Kubernetes and an application fails, it can self-heal itself, correct? So, is everybody pretty clear in terms of your background or the presentation that you're seeing, the importance of Kubernetes in implementing microservices? Because, if you're going to make changes not once a quarter, but once a week, or once a day, or multiple times a day, you have to work with speed. And you can't do it consistently unless you have some tool like Kubernetes, right? So, that's, that's square one. That's Table Six.
So, data has to work well together. I, I love hearing this joke, but I also hate hearing this joke, where you've all heard this: data scientists spend 80 percent of their time wrangling with the data and only 20 percent of the time doing data science. If you think about that, that's a data problem. That's a data modeling problem. That's a data integration problem. That's a data architecture problem that's not being solved. So, how do we take care of the big rocks first?
You're going to have to deal with streaming data. What is our strategy for data ingestion? And if it's going to potentially grow exponentially, how do we handle that growth with speed? How do we absorb it?
Operational data, this is all the data that you're integrating from different sources to be able to make decisions. You cannot make decisions without getting a lot of different data together. Now, if you're doing things in real time, what's the probability that that data and the characteristics that you need to make a decision is going to change?
The, the S.E. are going to change constantly, correct? So, if your models are making decisions, what if you have to be able to make a model change quickly? Can you do that? I know companies that spend weeks and months coming out with a new model, or making updates to it. What about when you have to do that faster? And the third is all of your data that your models use, all the features that you use to train those models, or the inference data that you have in production. All three of these types of data have to come together, and they have to be cohesive. So, it's not 'can you build this', how rapidly can you evolve it and change it to keep up with the speed that your business needs to run at?
So, we're going to talk about data execution.
With real time. So, here's one of the big problems. Every company I go to, they've got a VP of software engineering or software development, they have a VP that manages their databases and manages their data SRE teams, they've got a VP that does data warehousing, correct?
And we all talk about the issue with silos. We all live and breathe it every single day. Here's the problem with that mentality that we've all had. Data doesn't work in silos, especially not real-time data. And everybody's going to come to that realization sooner or later, when you can no longer keep up with the speed. Because, think about it, what does data do? It comes from these sources, and it gets ingested by some type of ingestion platform, correct? And then that data flows into an operational data store, or databases for persistence. Then that data flows into Data warehouses. So, you may have your silos, but the data is flowing through that ecosystem. And then for real time, the data flows back. Because for your models to make decisions, that inference data... it's in memory caches, it's in dashboards. So, as you're retraining those models, the data's got... some of that data has to flow back, correct?
And here is how we have to start changing. We have to look at a data ecosystem, how it lives, how the data breathes, and we have to start building our strategy for thinking of data as a flow of a data supply chain that flows back and forth.
So, all of you today have been going to a lot of Kubernetes presentations, and you're seeing the importance of using microservices for applications. So every reason that you would want to use Kubernetes for applications, can you see, if machine learning models are really programs that have to be deployed and have to be deployed consistently and they have to be deployed properly, wouldn't it make sense to use Kubernetes for that? Some of the top data scientists that I've worked in my past, I call them periodically. How often are you using Kubernetes? And they'll say, "Extensively. They're saying we could not deploy models without using Kubernetes."
Well, I did not come up with this idea like, "Oh, this is a really cool idea." I've got the scars on my back that I used to figure this out to solve problems. So, one of the problems I had is, if you're building out high-growth apps and those apps are generating real-time data, it has to go into data streams, correct? And if you're using microservice to deploy those high-growth apps, don't your data streams have to work with the same speed and agility?
Okay, those streams go into databases. Don't your databases need to work at the same speed? I'll give you an example. I was working at a company, and the DBAs deployed a new database, and in their Linux configuration, they had a configuration issue. It created a massive cascade effect. It almost brought their entire online site down, and they basically said, "This can never, ever happen again, and you have to prove to us that you are solving the problem so it won't happen again."
Well, we had gone to Microsoft, we've gone to Kubernetes for machine learning, we went to Kubernetes for our data streams. "Wait, we want to be able to deploy consistently with speed to be reliable. Shouldn't you be using Kubernetes for our data?" And here's what forced us. The company had a hybrid strategy, and they wanted to go to the cloud and use the cloud for disaster recovery and then over a period of time, move everything to the cloud.
So, they wanted to move their apps, but they couldn't move them because apps that were dependent upon streams that were dependent upon databases that were on premise, they had to be able to move to the cloud as well, correct? Now, if you're doing more edge computing, you might be building your models in the cloud, but you might be deploying those models on the edge, you might be deploying them on premise, you might be deploying them in cloud, in different clouds. So, all of this has to work together, and your databases, if you believe you should be using Kubernetes for those other data environments, if you want to have a cohesive ecosystem that works well together, shouldn't they all be working together?
And this is really, really hard, and it's not only hard from a technology perspective, because you all know Kubernetes is difficult and it's hard to get enough talent to do it well. So how do you get an organization to execute Kubernetes as an enterprise strategy? Because the challenge that I had is, I had Kubernetes teams that were in software development environments, that were in data science environments, that were in my data stream environment, and then I wanted to add the database, and they were all doing things differently. They were all using different clusters, they all had different skill sets, nobody was learning from each other.
Think about that. I want this data ecosystem to work cohesively together. How can I do that by having completely separate teams? Now, this is a fundamental problem that we've never solved. We've all seen this your whole career, right? We got to have very centralized systems. "Oh, that's not working, we need to go to decentralized systems." Three years later, "Oh, that's not working, we got to go back to centralized systems." You can see that this is a problem that we have to solve.
But I would challenge you that, how, if your company's going to be doing a lot of real-time AI and you need to be updating those models in the data form on a regular basis, how do you work consistently at scale with it? Now, Joan of Arc had a lot of good ideas, and they not only burned her at the stake, but they danced while they burned her. So this, to get your companies to accept this, I mean, we can't even imagine the political challenge that you're going to have with that. I don't believe companies are going to solve this because it's going to go, "Wow, this is a smart way of doing it." They're going to solve it when the pain threshold is so much that they're forced to.
So, what was my journey to Stateless Solutions? I had been implementing real-time AI for about 10 years, and I was using a certain data ingestion platform. It's extremely popular, and I implemented it two financial service organizations, and it was very successful. So, I went to a new company, and I said, "Why wouldn't I just use the same thing? It's the most popular software in that space." But when I started this new role, I went and talked to all the executives and all the lines of business executives and said, "What are your challenges? What are your issues? What's not working? What would you like to be able to do faster?"
And I collected all that information, and I just looked at the problem, and I realized, no matter what question I asked them, the problem was, they could not get the data teams to work at the speed that they needed, and there were constant mistakes, and those constant mistakes are shown to your customers and your partners before you see it. Once a couple simple examples, the company went to change a price to something at 22.22. It got changed to 22 cents, and somebody at a weekly meeting was saying, like, "Why are people now buying these this product in units of like 20 and 30? Well, they used to buy it in like one or two." And they were trying to figure out what's going on with the data. "Let's look at this calling information. Why has this changed but not this changing?" And somebody realized, "Oh, it's now 22 cents." It took a week to figure that out.
So, how do you work in real time if those type of things are occurring on a regular basis? So here, I'm talking about data, I'm not even talking about data governance, model governance, data lineage, model lineage. All those things have to be in place. And how many of your organizations are doing a really good job with that? And you now have to work 10 times faster. Something has to change, correct?
So, what was interesting for me is, I had used a certain ingestion platform, and I realized it wasn't going to meet my business requirements, and I go, "How could that be? I've used it twice in big companies. It's the most popular software in that space and then, it hit me. The business requirements why I had chosen that software in the past did not meet the criteria for the business requirements that I had today. Because things had gotten faster, that evolved. And so, I went with a product called Apache Pulsar.
Our company has two different offerings. It has an on-premise and a cloud offering. And so, I'm not telling you to use our software. What I am telling you is, when you're looking at making decisions on your technology stack, don't go with what's the most popular. Don't go with what has the market share. Take a look at the criteria and make sure that you really understand it. And make sure it's going to meet the conditions that you need for that type of software.
I have worked at two big database vendors in my past. So obviously, those databases were my preference. Correct? And I realized the journey that I went through, that my database had to be able to work cohesively with my streaming platform. I had to be able to move from on-premise to the cloud, and across clouds. So, what database could I get to work right with that ingestion platform?
My recommendation to you is, if you're looking for a new database, don't go to the developers. The data scientists. The DBAs. Understand the business requirement because I guarantee you, most companies are choosing a database based upon somebody's preference or what they're experienced with. Not really understanding what are the future business requirements going to be for that type of platform. So, if you notice, once again, I'm saying that build a cohesive system for real time.
I'm not saying that you change anything that you have today. But why is it acceptable for organizations to keep absorbing more and more technical debt from decisions that no longer apply today? What I am saying is, as you build out your technology stacks, sooner or later you have to draw a line in the sand and say, we're not going to keep absorbing more technical debt. We may not be able to have the resources, the funding to replace what we have. But let's make sure moving forward, we're making the right decisions.
So, I would be more than glad at any time. Or, my company, Data Stacks. We're gonna be at the booth to just talk to you. Not about our software. But about characteristics and challenges and some of the things that I went through to move organizations past that. Because that type of cohesiveness is really hard to achieve. And you'll never achieve it. I've been an acting CDO where all data reported to my office. I did not achieve that because I mandated it. And I said, "Hey, here is what you are going to do."
I had to create a vision that the business and IT would align to. And more importantly, commit to. And that was massively hard. And I had to constantly go out and meet with the technology teams. And I had to constantly go out and meet with the business teams to make sure that they stayed aligned with that. Because even if you have a strategy and a vision that everybody agrees to, here's the problem. You'll have strategy drift. You're all familiar with process drift and model drift and process drift. You also have strategy drift.
And what strategy drift is, is everybody says, "Hey, here's what we're going to do in the future. Here's our approach. Here's what we believe in. We're going to be data driven." But downstream teams... but we don't have the resources. We don't have the people. It's so much easier for us to deploy this database that we know really well. Or not use Kubernetes. Or not use whatever because we don't have the resources. Project one goes that way. Project two goes that way. And all of a sudden, you realize, we have the strategy. We have this vision. But the downstream teams aren't executing to it. There's always reasons why they can't.
That's why you see a company, a year later, it will go on, "We got all these changes." And a year later, the executives are going, "What's our ROI? Nothing's changed." Right? Can you see that happening over and over and over again?
So, where we have an opportunity as practitioners is, how can we play a role in helping our organization get faster at execution? And it's not going to happen because we're going to hire more talented people. And it's not going to happen because we have more super cool software. It's going to happen that we change our mindsets on how we go about solving these problems.
I'll give you an example. Data driven. Never in the history of our careers has data been more important. Everybody talks about being data driven. Correct? Let's say, 80 to 90% of the companies that say that they're data driven are not. And I would recommend that you go out and look at a survey. They come out once a year. It's called the New Vantage Partners Data and Analytics Survey. And they interview like 110 blue chip CIOs and get their feedback.
Blue chip companies, the CIOs. Only 19% in this survey that came out in January said that they were data driven. Only 19%. Only 25% had said that they built a data-driven culture. Top blue chip companies. When they do an honest assessment, say, "We're not accomplishing that. What about the rest of us?"
So, this mindset and this change and this new way of looking at a problem. I really recommend that you try to do that. And you can't control the rest of the world. You're not going to change your whole company. But maybe you can get more efficient in how you do things.
Maybe you can say, "Hey, we're implementing Kubernetes. And we're working with the application team. Let's have meetings with the Kubernetes team that's building out the machine learning models. Let's mingle. Let's see how we can cross-pollinate and make a difference in your group." Because I guarantee you, your company will see the results of that.
So, what I ask that you look at is, you look at the dimensions of your data. And here's how I look at the dimensions of the data. From the streaming perspective. From the operational integrated data that I have to collect. That I need to process. And then my machine learning data.
How do I get all three of those to work well together? Because the one thing that you hear data scientists always talk about is data quality. Correct? You have to get this data to work together. And as you build out your technology stack, that technology stack has to align across those dimensions.
So, something I'd recommend is, you go out and look at this URL. And read some of the white papers and videos that are there. Just look at it, not from our product's perspective. But just a real-time AI perspective. And saying, "How can we get more efficient in our execution?"
And the most important thing is, if you're attending the booth crawl tonight, we're going to be giving away a LEGO set. So, come by the booth, say hi. We'll drink mimosas together. And we'll get you signed up for the Lego set. So everybody, thank you. I appreciate your time and I'll be here for the next few days as well as my team. And, if we can help you in any way, let us know. Enjoy the rest of your conference. Thank you.
And for the book, you gave at least a partial laugh, okay.
Are there any questions? Does anybody have questions?
What do I like about that though? Did you read the title of the book? Yeah, 'Managing Cloud Native Data on Kubernetes.' On so, as you know, thinking about data with Kubernetes is not natural and you're not going to be able to jump in the deep end. Take baby steps. So that would be a good book that I recommend if you take a look at, and look at it from like a 20,000 foot view. Look at the perspective on how we can evolve our data in our day-to-day work. So anyway, thank you all.
Stay up to date
Sign up to the Navigate mailing list and stay in the loop with all the latest updates and news about the event.