The Future of GraphQL Federation

Apollo introduced GraphQL Federation v1.0 in May 2019 . Since then, there’s been one major iteration to v2.0 released in April 2022 .

Apollo Federation solves a real need, but also still has many significant shortcomings, even 6 years after it was introduced.

In this post, we’ll explore the good, the bad, and where the ecosystem is going to build a better future for GraphQL Federation.

🤔 Why Federation?

Client engineers love using GraphQL. Being able to see all of your data as a single, queryable entity, and not having to rebuild entity joining logic on every client (web + Android + iOS + tvOS + etc), is incredibly powerful, and enables teams to move much more quickly.

But GraphQL is just a schema and query specification — it offers zero guidance on managing complexity as your schema grows from dozens to thousands of types. As schemas and GraphQL deployments grow, companies start running into the same set of organizational problems that lead them to adopt microservices (I won’t debate whether microservices are a good pattern here, you can ask ChatGPT about that holy war).

Apollo had a great insight: with Federation, rather than building a giant, monolithic GraphQL server, companies can split the GraphQL schema into “subgraphs”. This enables us to get the best of all worlds:

Subgraphs implemented using microservice architecture enables decoupled teams and deployments.
Client engineers still interact with the GraphQL schema as a single, unified artifact, blissfully unaware of the complex implementation under the hood.

✅ Good parts

Apollo Federation gets a lot right:

🔑 Declarative Relationships with @key

The core of the Apollo Federation spec revolves around the concept of foreign key constraints (@key in the Federation specification) and a query planner — not unlike a database engine.

Relationships between subgraph microservices can be made declaratively with the Federation Router, the equivalent of the database engine, planning queries, executing them, and assembling the responses. This works very well to enable entity definitions, such as type User, to be split across multiple microservices.

🖥️ Computed properties

There are other cool ideas in the spec as well, like @requires, which enables subgraphs to contribute derived fields to existing entities. As we’ll see later though, this API feels incomplete in some key ways.

🚨 Problems

When you dig in more deeply, it becomes clear that there serious design and scaling concerns.

🌊 Apollo Federation leaks into the subgraphs

The first and largest issue is that the Apollo Federation spec leaks implementation details about how query planner works out of the Router and into the subgraphs. Let’s take this simple example:

When the Router performs composition on these schemas, and then receives a a query like following from a client:

It expands conceptually into the following query plan:

Get user by ID from user-subgraph
Do a users entity lookup from review-subgraph by @key using the user ID from #1. Include the reviews.body field.
Assemble the results into a single response.

We’ve effectively invented the concept of “entity lookups” by key. The Router is aware of this concept as it has to plan, execute, and then assemble the response using these rules.

The GraphQL subgraphs, however, also need to understand the way in which @key maps into entity lookups. Query planning and execution logic leaks into the subgraph implementation itself — even though the Router has already done all of the work to compose and validate the supergraph and plan queries.

This leaking abstraction is managable for @key, however when you move on to more powerful primitives like @requires or @provides, it turns out that this is really hard to do .

🧩 75+ GraphQL server implementations

This complex work gets pushed down into every single GraphQL server to implement custom, non-GraphQL spec code in the server itself to be compatible with Federation — and there are 75+ different GraphQL server implementations!

Not only does initial support require these changes, but any future Federation primitive that gets added must also implement even more custom code to be compatible with this change. This isn’t scalable if we want to add new capabilities and Federation primitives (like improvements for @requires). Changing 75+ servers every time you have a new idea is a huge barrier to adoption.

🐢 Performance bottlenecks

Another issue is subgraph performance. Queries that come into the Router must then be broken down into smaller queries and sent to the subgraph servers. In Apollo Federation, this happens over HTTP+JSON using standard GraphQL requests. Without a complex pipeline that includes Persisted / allowlisted Queries at the subgraph level, subgraph servers must parse the incoming GraphQL request strings into AST, execute them, and send them back out over HTTP. For companies with a high load on their servers, this is a very meaningful performance hit that you wouldn’t have using something like Proto+gRPC or Thrift.

👩🏻‍🔬 Subgraphs are overengineered

GraphQL is inherently fairly complex to implement in the server. It requires an advanced orchestration engine (the GraphQL server framework), and an engineer who understands nuances like N+1 queries and how to avoid them . This complexity is necessary when you’re implementing a monolithic graph. There can be arbitrary, cyclical relationships in the graph, and the mechanism to only fetch the necessary fields is to have a GraphQL server implementation that parses the query AST, walks it, and that recursively calls resolver functions.

With Federation, however, the subgraphs don’t actually need to do this at all. We can have the Router do all of the complex work of orchestrating and calling the correct “resolvers” (subgraphs), and leave the subgraphs to be dumb, “normal”, API servers.

😕 `@requires` is a headache

@requires on its face seems like a very powerful API. You can take existing fields in the graph, declaratively request them, and contribute back new fields on the same entity that are derived from these input fields.

This is a powerful concept that enables us to split end to end field ownership along team lines. It allows for a truly decoupled data composition pattern that’s incredibly powerful and can enable feature teams to very quickly iterate and build on a core schema.

With @requires, your GraphQL subgraph server effectively becomes both a GraphQL client and a server. There are a few issues with the spec and implementation:

You can’t @requires fields from your own subgraph. This significantly limits its flexibility.
There’s no way to receive errors from the upstream fields: how do we deal with a case where when I want to omit the field using @requires when there was an upstream error? Or maybe I still want to return it, but I want to modify the result if one of the fields errored out.
Implementing this in a type safe way in GraphQL server frameworks is extremely complex .

💡 What can we do instead?

If we think from first principles, what are the ideal characteristics of a Federation GraphQL system that we would want to have?

Ideal Federation specification properties

Clients see only a unified supergraph schema: the complexities and implementation details of the backend are completely hidden away from them.
Push as much complexity as possible into the Router / composition layer, leaving subgraph servers as simple as possible.
Subgraph servers can be implemented in any programming language with limited to no additional effort when adding new Federation capabilities.
High performance subgraph servers out of the box, automatically implementing entity batching, and using pre-compiled interfaces to improve performance and reduce network serialization CPU costs and wire latency.

Options 1 — Composite Schema specification

A group of people are currently working on a new version of subgraph Federation specification called the Composite Schema specification . They’ve recognized many of the problems I write about here, and are working to rectify them.

For example, instead of having an @key directive that implicitly generates an _Entities type and an entity lookup endpoint, instead you can simply attach an @lookup directive to an existing query resolver, and the supergraph Composer and Router will be able to parse it and build out entity relationships:

By simply using a GraphQL spec compliant directive, we no longer have a Federation-aware subgraph server implementation. This solves for requirement #3, and mostly solves for #2.

A similar solution is proposed for @requires (renamed to @require) : you annotate a field argument with the @require directive, and the Router will simply populate it before calling the subgraph:

On its face, it seems like this is trending towards the requirements we set out. It falls down in a couple of pretty key ways, however:

How can you do batching in this world? You’ve traded an implicitly batch entity lookup from Apollo Federation (type Query { _entities(representations: [Any!]!): [Entities!] }) with an API this cannot be batched at all. Field arguments in GraphQL cannot be specified on a per element basis, because GraphQL was not intended to be used for orchestration in this way — it’s a client side query language intended for when the client doesn’t know how many entities it will get back. Because @require uses field arguments, any @lookup fields cannot be combined with @require if we want to retain batching, which is absolutely critical for performance and to solve the N+1 problem.
There’s no mechanism to pass upstream errors into the @require field. If I want the result of the Product.deliveryStatus resolver to be different when dimension.size is semantically null vs null because of an error, how can I do this? My resolver function needs to operate on the error as well as the input value, however there’s no place within the GraphQL language that enables me to pass in anything for size other than an Int.

This approach looks it will be able to solve for requirements #1 and 2. It partially solves for #3 (however seems to be boxed out of fully solving it), and actually takes a step backwards in terms of #4 (performance).

Option 2 — Proto+gRPC subgraphs

What if instead of using GraphQL in subgraphs at all, we generate Proto APIs that subgraphs implement instead? This is the approach that Wundergraph is currently taking with Cosmo .

Automatically generate batch entity lookup endpoints, solving the N+1 problem.
Proto already has native, high performance implementations in every major programming language.
New Federation primitives can be generated as new endpoint contracts, enabling the community to play with new ideas without being limited to the GraphQL language spec or existing contracts.

For example, a version of @requires with this model could simply be a new endpoint:

Which could transpile into the Proto:

Here, when we request the @requires field we’re trading an additional gRPC network request inside of our data center between the Router and the subgraph (~10ms) for pushing all of the complexity of @requires into the Router along with built in batching and high performance networking out of the box.

This approach looks to me from first principles like it will eventually be able to fully solve for requirements #1–4. It’s not fully there yet, but I don’t see any obvious architectural limitations.

🏁 Final thoughts

GraphQL Federation is an incredibly powerful concept, but we as a community still haven’t nailed the right model for it to really scale inside of large enterprises. There hasn’t been a ton of progress in the last few years I believe primary because of the issues I outlined here.

I see a future where more orchestration is done declaratively in the GraphQL schema, enabling smarter query planning and caching, and unlocking the true power of density in the graph. This goes beyond declaring foreign key relationships. That’s a great start, but it isn’t enough for teams with 1000s of engineers trying to build complex products.

The idea of replacing GraphQL subgraphs with simpler Proto servers moves us in this direction. Having the flexibility to invent new Federation primitives / directives and simply generate net new endpoint contracts from them without worrying about how they fit into the existing GraphQL schema language opens the door to much quicker iteration.

I think we’re just at the beginning of starting to see more enterprises adopt GraphQL as we move into a new evolution of GraphQL Federation based on smarter Routers and simpler, more performant subgraphs.

Editor’s note: Curtis’s post closely reflects how we’ve been thinking about the future of Federation at WunderGraph. For a deeper look at the gRPC-based approach we’re building toward, read The Future of Federation: Replacing GraphQL Subgraphs with gRPC Services by our CEO, Jens Neuse.

Originally published on Medium by Curtis Layne .

Reposted with permission.

Router / Gateway

MCP Gateway

Documentation

Zero to Production

GitHub

Community