GraphQL and Cloud-Native Architecture: Is There a Better Way?
Speaker: Jim Barton
Summary
Jim Barton discusses the benefits of integrating GraphQL with application networking to address complex application networking problems. Barton uses a fictional scenario involving "Wayne Telecom," a telecom services provider in Gotham City, to illustrate the challenges faced by businesses in managing data and network requests. The talk emphasizes the advantages of GraphQL, such as reducing over-fetching of data, improving developer efficiency, and simplifying the architecture by eliminating the need for dedicated GraphQL servers. The presenter also touches on the importance of declarative configuration and the potential of leveraging existing API gateways to serve GraphQL. The session concludes with a demonstration of how GraphQL can be implemented and accessed.
Transcription
Good afternoon everyone. First of all, I want to say, last day of a conference, almost 4 P.M., you are the hardcore. You're the hardcore audience member, so I want to thank you. I want to thank you for being here. So, in this session, we're going to talk about how you can unleash declarative configuration with GraphQL to solve some of your most challenging application networking problems. But, you know, like I said, it's four o'clock in the afternoon. So if I just throw up a deck of 42 slides and drone through each one, half of you will be asleep by 4:15. No? Okay, I need to be closer. Okay, that's...
So, to combat the possibility of you falling asleep, we're going to organize this session as a little bit of a role-playing session. So, first of all, let's talk about what your role is in all of this. For the next 40 minutes, you, the audience, are Wayne Telecom. You are a telecom services provider based in Gotham City. You are the largest provider in Gotham City and you require highly available, low latency connections for both your mobile and web clients. So together, we're going to embark on a journey that's going to allow you to spend less of your billions on cloud infrastructure and get your organization back to fighting crime in Gotham City. Trust me, it's going to be a good time.
I'm sorry, what's that?
If you want to be the villain instead, then you'll need to find another talk. I don't know. So, just... that's good. I like a lively audience. So, just a word about who I am. My name is Jim Barton. I am a field engineer here in the U.S. with solo.io. My career in the Enterprise Computing space spans over 30 years. At this point, prior to solo, I was an architect with Amazon, Red Hat, and Zappos.com before that. And just very briefly, who is solo? So, solo.io is a company born in the cloud that specializes in helping enterprises navigate the complexities of application networking in a cloud-native context. We do this via strategic leadership in a number of open-source projects, for example, the Istio service mesh, Envoy proxy, and of course, our topic today, GraphQL. And we offer an enterprise-grade service platform that is based on those projects. So, let's turn our attention back to your problems at Wayne Telecom.
You are hemorrhaging money. Your mobile experience is absolutely dreadful. You're spending way too much money trying to band-aid your problems, and your development and operation teams are both really unhappy. So, let's drill deeper into why these problems exist and explore some potential solutions. First of all, let's take a look at one of your front-end application REST APIs. This is a sample page from your billing application. This page loads all of the phone plans that the current user has active. And in order to get adequate information for the summary page, you have to make multiple REST API calls per item on the page. So, your backends are microservices that are owned across different application groups. And so, fetching data for this summary page requires making multiple calls to various microservices and then aggregating the responses in the front end.
Alright, so the problem with making these multiple REST API calls is illustrated on this diagram here. Because you have a worldwide business model, you often get traffic from across borders, and even from the other side of the world. So, request round trips can take really significant times that lead to slow user experience. That can be, of course, aggravated by poor mobile connections, and that sort of thing. So, that's no fun.
There are, of course, a number of approaches we could use to address pieces of this problem. One thing we could do is to use a backend for frontend pattern. We could create an additional backend service which exposes aggregate endpoints. So, basically, our mobile app would only have to make a single API call to our backend for frontend service, which would then aggregate the responses from all of the individual backend services. That's one optimization we could apply. We could also do things with, say, content delivery networks by caching some of the content. That will help us for some of our use cases, but not for things we're actually doing rights via things like HTTP posts and puts. So, we can make some incremental improvements with this, but it's still not an ideal solution for the way apps are built in large multi-tenant organizations like you have at Wayne Telecom.
So, we need to explore some other approaches so that we can free up more of our organization's resources to fight crime here in Gotham City. This is where we introduce GraphQL into the mix. We can solve problems like the request waterfalls, like large payloads, as well as increasing our developer efficiency using GraphQL. So, how does that work? Well, let's take a look at this diagram. If our frontend leverages GraphQL, then it only has to issue one query to a GraphQL server to get all of the data it needs. So, what you see on this slide is a single GraphQL query that returns all of the data, and only the data that the frontend requires in a single HTTP response. And that's going to allow us to transform the backend of our services to use GraphQL.
Okay, and that offers us some significant benefits. So, number one, our queries now return exactly the data we need and nothing more. So, with GraphQL, we are no longer at the mercy of bloated service interfaces that return every piece of data that a potential client might want who's accessing this service. Instead, we can specify precisely the data we want, and that's exactly what we get back. Number two, GraphQL allows us to use a single query to retrieve data that lives in multiple backend resources. So, no longer are we required to make multiple service calls and then use our client to sort of splice the data together from all of these different service interfaces. We have a unified schema, we only need a single query, and then we allow the GraphQL server to manage the dispatching of the requests and aggregating of response data to match exactly what we're asking for in our query.
Okay, so the third benefit is that GraphQL offers first-class support for a type schema. You can almost think, in a way, of GraphQL schema as being sort of like an open API or Swagger specification for your GraphQL endpoint, which is a good thing. So, you can see that pretty clearly in this example. In this example, we have a GraphQL query that your application issues, that's on the left side of the screen, and there's a corresponding schema definition on the right side. So, if you watch the kind of bolded and underlined text as we advance through the schema, you'll see exactly how each part of the query corresponds to the GraphQL schema.
Okay, so that's nice. And because of that type schema, it also opens up a new world of tools that your developer teams at Wayne Telecom will now have access to. For example, they can leverage tools like GraphQL Playground, which is shown on this slide. Frontend developers can basically explore schema interactively, they can build up their queries as they go, again interactively, and then more easily incorporate those into their application services.
So, let's assume we can wave our hands and make that happen. Where do we stand with Wayne Telecom now? Let's review what we've done so far. We have deployed a GraphQL server. Our application teams have added GraphQL schemas and resolvers to their services. Those schemas and resolvers have been added to a GraphQL gateway. The platform team is going to manage this new gateway. The good news is that our front-end teams are now much happier and more productive. The developer experience has improved tremendously, and we've removed a lot of the friction that existed between our front-end and back-end teams by using this back-end for front-end pattern to create an efficient GraphQL interface between these two sets of teams. Our mobile app is now much more performant thanks to these changes that we have implemented. But we still have some issues.
On the front end, we've delivered some massive improvements, but we still face some challenges on the back end of the equation. We operate in the telco industry, which means that we are highly regulated. Consequently, we need to implement a zero-trust architecture throughout our infrastructure. We need to worry about capabilities like using mTLS to secure the communication that's flowing among all of our services, not just between our new GraphQL server and the back-end service, but also between the back-end services themselves. This is honestly a pretty heavy lift for an enterprise at the scale of Wayne Telecom. So, we adopt service mesh technology, something like an Istio platform. That's going to allow us to externalize general platform features like mTLS, rate limiting, caching, and authentication away from the applications themselves and absorb that into our service mesh infrastructure.
After making those changes to adopt GraphQL and to adopt Istio as a service mesh, our revised Wayne Telecom infrastructure looks something like this diagram. What we're showing here is just a single Kubernetes cluster with Istio installed. We have an Istio Ingress Gateway where our traffic enters the mesh. From there, we forward requests onto our GraphQL server, which acts as our back end for front end. It's going to manage dispatching requests to the back-end services, aggregating responses, and then returning those to our service clients. And of course, all of the internal communication within our mesh is going to be encrypted using mTLS. So, let's take this as the current state of our enterprise infrastructure.
Now, what I want to do is zoom in on two of the components in this architecture: one, the Ingress Gateway Envoy proxy, and then also the GraphQL server. This is a very common configuration that we see with customers who are deploying GraphQL today. Application teams pick up a GraphQL framework, they write some code to resolve multiple upstream data sources into a GraphQL API, they integrate with a number of libraries in the process, and they produce an application deployment that is then managed by a platform team. The platform team owns the operational responsibility for the health and availability of this deployment. But the story doesn't end there. The GraphQL server exports an API, and it needs to be protected. So, the platform team is going to front that GraphQL API with some kind of proxy. The proxy in this architecture is managed as cloud-native infrastructure. It's configured declaratively, it's compatible with modern Kubernetes platform patterns, and it's based on the leading proxy technology in the market, Envoy. The GraphQL server is a separate deployment, requiring code changes to modify and evolve its behavior. It also represents an extra network hop and additional operational overhead to maintain that separate service deployment. Now, there's nothing fundamentally wrong with this approach, but we feel there's a more efficient way to support GraphQL in an application architecture that's going to simplify the lives of both our app dev teams as well as our platform engineering teams.
Another question that arises from this architecture is where best to handle platform concerns like authentication and authorization. Ideally, we'd like to separate these concerns from the application itself and handle them at the gateway proxy level, just as we would for, say, an ordinary open API interface or many other kinds of API interfaces. But with the GraphQL server separated from the gateway proxy, we're forced to pass some of these concerns through and handle them outside the proxy layer. We see GraphQL users handle this in a couple of ways. The left side of the slide represents an approach where we simply take the auth-related context from the request and delegate those authentication and authorization decisions through to the backend service. The right side illustrates another approach where we take the same context and maybe instrument some identity-aware authentication code directly into the GraphQL server. Neither of these approaches is ideal. What we'd like to see is the ability to offload these concerns from imperative application code and handle them in a declarative, policy-driven fashion.
Let's take a step back and look at where we are on our Wayne Telecom journey. We've solved a number of significant problems. We've increased our front-end developers' efficiency, reduced some of our data over-fetching and bandwidth requirements, and implemented a back-end for front-end architectural pattern, all using GraphQL. Plus, we've adopted service mesh technology using Istio to lay the foundation for our zero-trust networking architecture. Have we rid Gotham City of crime? Well, maybe it's a little too early to declare victory in that battle just yet. But as we all know, because we've been doing this for a while, engineering decisions are rarely 100% positive. Every decision we make represents some kind of trade-off. Let's consider some of the downsides of the changes that we've adopted so far.
First, we've added a major new moving part in our server-side infrastructure: a dedicated GraphQL server. Our platform team has to be responsible for the care, feeding, and maintenance of that new piece of infrastructure. In addition, we've also added new responsibilities for our application teams. They need to learn GraphQL and maintain the associated schema and resolvers for their individual backend services. As we explored just a moment ago, they also need to re-implement some platform-related features like authentication and authorization to account for the fact that they are now using GraphQL as a core component of their application strategy.
When we consider these trade-offs, we want to maintain the good things that we've achieved with GraphQL and with Istio, but without adding the burden of new server types to maintain and with less of a burden on the application teams whose services participate in the mesh. We conclude, "Robin, there must be a better way."
Toward that objective of finding a better way, let's drill in a bit and take a closer look at both our proxy, our API gateway proxy that serves as our Istio Ingress Gateway, as well as our new GraphQL server. The GraphQL server, at this point, is its own separate deployment that requires code changes to modify and evolve behavior. It also involves an extra network hop and an extra moving part to maintain in our server portfolio. Again, nothing fundamentally wrong with this approach, but we feel there's a more elegant and efficient way to support GraphQL within our application architecture. We believe this approach can simplify the lives of both our application teams as well as our platform engineering teams.
So, let's ask ourselves a couple of questions here. What if there were a way to support GraphQL APIs that didn't require dedicated servers? What if we could reuse existing API gateways to serve GraphQL? In other words, just like we support open API, gRPC, and things like SOAP XML protocols today, what if we could just add GraphQL as another type of API that we support within our existing gateway architectures? And what if you could do that using best DevOps practices like declarative configuration, as opposed to having to implement those things in imperative code? And finally, what if you could leverage existing API contracts to build your GraphQL configuration?
Where these questions lead us is to a simplified architecture that consolidates GraphQL and application networking responsibilities into a single component. What we're proposing is updating our gateway Envoy proxy fleet so that it can function itself as a GraphQL server, not being required to separate that into a separate component. Some of the things that will do for us include:
Eliminating the development and operational expense of managing and maintaining a separate GraphQL-focused application deployment. By removing the additional network hop to a separate GraphQL server, we improve performance and resilience because we are avoiding an extra failure mode on the data path for our request.
These GraphQL capabilities are based on declarative configuration and not on imperative code. In other words, it becomes just like the rest of your cloud-native infrastructure, fully compatible with things like CI/CD and GitOps workflows.
How is this all going to work? By leveraging existing capabilities that organizations like Solo provide in a GraphQL Envoy filter, there's no longer any need to integrate with third-party libraries to create resolvers that run inside of GraphQL-specific servers. With simple configuration changes, we can leverage an existing gateway proxy to add things like policy-driven authentication and authorization, plus services like rate limiting, response caching, and web application firewall rules, all at the edge of your application network. All of those capabilities are driven by declarative policies, not by imperative code.
You might have noticed earlier that I kind of waved my hands a little bit and magically transformed our existing backend services into GraphQL-aware services. Does that mean that each application team must now go and implement GraphQL awareness into each of their backend services? I recently appeared on a panel discussion with a Netflix engineer. Netflix is probably one of the most sophisticated GraphQL organizations that I've seen anywhere. They talked about that process of enabling backend services for GraphQL as one of the most difficult passages in their own GraphQL journey. We'd like to avoid that pain as much as possible.
What if we could transform these services into GraphQL without touching the underlying application code? In fact, there is a popular open-source library out there that does this very thing in JavaScript. It's called GraphQL Mesh. It leverages existing service specs to generate code that facilitates the conversion of pre-existing services into GraphQL-aware services. However, what if we could avoid even the adoption of tools like that with something like a discovery capability that could be implemented directly in our existing proxy fleet? The secret sauce would be to leverage the existing interface specifications that are already in place on the backend services, things like open API and Swagger for REST services, protobufs for gRPC services, WSDL for SOAP interfaces, and so on. By putting a sidecar that contains this discovery and translation logic next to each of these services, we're able to translate incoming GraphQL requests into requests that the application already understands in their native protocols. Effectively, we've converted the services that we want to include in the graph into GraphQL just by including them into our service mesh. That is a pretty powerful capability.
If we zoom back out and consider our new deployment architecture with GraphQL, this new model opens up new architectural possibilities, which not only enable more efficient traffic handling but also better separation of concerns within our service deployments. For example, you can see from this diagram that we have three backend-for-frontend deployments here: billing, sales, HR services. But these can now be strictly virtual services from a deployment standpoint. They all live within our gateway proxy, just on different request routing paths. The infrastructure is not only more efficient at runtime but also much cleaner from a design standpoint.
This also means we will be more resource-efficient and easier to administer as well. You might recall that with our original design, we were forced to write imperative code to handle concerns like authentication and authorization. Now, with GraphQL embedded into our proxy, the kinds of declarative Istio policy shown on this slide for things like auth, failover, circuit breaking, and rate limiting work exactly as we expect without requiring any imperative code to drive that. There's no separate GraphQL service instance within our data path anymore that's going to gum up the works. What that allows us to do is to move from an old programmatic authentication/authorization configuration like this one to something more like this, where we can have an API that simply describes how we want our APIs to operate with respect to auth. Pretty powerful stuff.
What does this declarative config bias? First, it's going to allow us to replace programmatic GraphQL schema and resolvers with declarative configuration. Second, we can use GraphQL custom directives, such as a "resolve" directive like shown on this slide, to link resolver configuration with particular fields. Finally, the configuration can be discovered by the control plane from existing server interfaces, essentially writing the GraphQL server for you. We can discover and create these GraphQL API schema objects on a per-service basis, leveraging the interface contracts that those services are already publishing. But we don't have to use that discovery capability if we don't want to. We can also build and maintain our own schema if we prefer that approach. We can also stitch these schemas together into a unified super graph. That allows clients to think just about the data they need from the graph and not necessarily about what services they need to invoke in order to retrieve that data. The embedded proxy filter and the embedded GraphQL proxy filter handle all those details around dispatching and data retrieval for them. That's a powerful thing.
Let's make sure we have just a few minutes here. Let me quickly go to a demonstration to show you how this works in practice.
All right, so I'm going to switch over here. This is an environment called Instruct where we've provisioned a Kubernetes cluster out here in our environment. We've also installed some basic services. We have a couple of demo applications here. We've got something that represents blogs, comments within the blogs, and then users who have either written the blogs or provided comments. We have these little microservices that represent that. We can confirm we've installed virtual services at the gateway layer. So, for example, we can do things like this, and you can see we can actually curl that blog endpoint. We get back these JSON representations of our blogs that are stored in our database. You can see things like the username, the title for the blog, the content that's associated with it, and so on. This is just an ordinary REST service that exists within our Kubernetes cluster that we have presented to the outside world using a standard virtual service exposed via an Envoy proxy.
Now, let's see what the process would look like for us to make that available via GraphQL. At this point, we've done nothing to expose any GraphQL from the applications themselves or to the outside world. For example, there is a GraphQL API object, which is just a Kubernetes custom resource in our environment here. If we take a look at that, you can see that we don't have any of those defined. All we have is just a simple collection of REST services.
The first thing we want to do is activate a discovery feature. It's going to look at these services that we have deployed, discover the interfaces (in this example, we've got an open API interface, a gRPC interface), and then generate an initial take on what a GraphQL API would look like. We can modify that if we want to. We're not tied to this discovery mechanism, but it gives us a good place to start. We've labeled this blog service for discovery. Now, if we go back and take a look at the GraphQL API objects, we have one that has been discovered based on this service. We can take a look at that in YAML form. It's also easier to understand if we drill into this using the admin interface. We have a GraphQL query that's been discovered on that interface, a mutation that allows us to update new blog content, and the actual data elements themselves. All of this has been discovered for us.
Now, we can use these query interfaces to get at what's going on inside this service that knows nothing about GraphQL. But we can use a GraphQL interface to explore that information. So how do we do that? First, we need to set up some routing to this service. We had this endpoint, which is just an endpoint to the native REST endpoint that this service supports. We've now added another route to this service that says there's also a GraphQL schema associated with this endpoint. We're going to route this "/graphql" request path to that GraphQL API. When we do that, we can access it using a curl endpoint, but there are also tools within the environment that allow us to access it there as well.
If we go into this Explorer example, I need to get access to the endpoint because this is running in an Instruct environment. I need to get the endpoint that's exposed from that environment. Let's put that endpoint here, and then we'll also provide a GraphQL query right here. Let's paste that in. You can see we're getting blog content, the ID, the username, and so forth. We run that query, and what's going on here? Oh, I didn't apply the final change.
Thank you. There we go. Now you can see there's the blog content, just like we saw in the initial REST request, only now it's being served up via GraphQL. We could go in here and change that. Let's say we wanted to get rid of this content and replace it with the title of the blog. We could run that, and we can interactively play with these, again, backend services that know nothing about GraphQL and inspect the various elements here.
There's more I'd like to do, but we are out of time. If you're interested in more details on this, I invite you to go to our free learning site, academy.solo.io. There's more information about this and how it works. There's also a Slack environment, slack.solo.io, where you can learn more. I want to thank you for your time on this late afternoon. Thank you.
Stay up to date
Sign up to the Navigate mailing list and stay in the loop with all the latest updates and news about the event.