Migrating Netflix to GraphQL: A Seamless Transition

Introduction

In this blog post, we will explore the journey of how Netflix successfully migrated its mobile apps to GraphQL, completely overhauling the client to API layer. This migration was achieved with zero downtime and involved careful planning, testing, and leveraging the power of GraphQL. Let’s dive into the details and learn from Netflix’s experience.

Phase 1: GraphQL Shim Service

To kickstart the migration process, Netflix created a GraphQL Shim Service on top of the existing Monolithic Falcor API. This approach allowed the client engineers to quickly adopt GraphQL without being blocked by server-side migrations. They could explore GraphQL client-side concerns like cache normalization, experiment with different GraphQL clients, and investigate client performance. The GraphQL Shim Service acted as a bridge, enabling a seamless integration into the existing system.

Phase 2: Deprecating Legacy API Monolith

Building on the success of the GraphQL Shim Service, Netflix moved towards deprecating the Legacy API Monolith and embraced GraphQL services owned by domain teams. This transition utilizes Federated GraphQL, a distributed approach where specific sections of the API are managed and owned independently by domain teams. This not only enables better scalability but also empowers teams to take ownership of their respective domains.

Testing Strategies

During the migration process, Netflix employed various testing strategies to ensure a safe and successful transition to GraphQL. Let’s explore some of these strategies in more detail:

AB Testing

AB Testing, a widely used technique at Netflix, played a crucial role in evaluating the impact of migrating from Falcor to GraphQL. By comparing metrics such as error rates, latencies, and time to render between the legacy Falcor stack and the GraphQL Shim Service, Netflix could assess the customer impact of the new system. This provided valuable insights into the performance and correctness of GraphQL.

Replay Testing

To ensure the functional correctness of the migration, Netflix developed a dedicated tool called Replay Testing. This tool was designed to verify the migration of idempotent APIs from the GraphQL Shim Service to the Video API service. By capturing and comparing response payloads, Netflix could identify any differences and ensure that the functionality remained intact. This meticulous validation process helped mitigate any potential issues during the transition.

Sticky Canaries

While Replay Testing focused on functional correctness, Netflix also wanted to ensure that the migration delivered improved performance and business metrics. To achieve this, they turned to a powerful Netflix tool called Sticky Canary. By assigning customers to either canary or baseline hosts for the duration of an experiment, Netflix could gather insights into latency, resource utilization, and overall quality of experience (QoE) metrics. This enabled them to fine-tune and optimize the new GraphQL services based on real-world performance data.

Conclusion

Netflix’s migration to GraphQL was a resounding success, thanks to careful planning, strategic testing, and a step-by-step approach. By creating a GraphQL Shim Service and gradually deprecating the legacy API monolith in favor of domain-owned GraphQL services, Netflix was able to seamlessly transition to a more scalable and flexible system. Through AB Testing, Replay Testing, and Sticky Canaries, they ensured both functional correctness and improved performance metrics.

Migrating to GraphQL is not a small undertaking, but with the right strategies in place, it can be achieved without disrupting the user experience. Netflix’s experience serves as a valuable lesson for any organization considering a migration to GraphQL.

Tags: Netflix, GraphQL, Migration, API
Reference Link