Migrating Stitch Fix’s Core Recommenders to One Config-Driven Platform
Summary
I’m migrating three integrations we have for payment processors into one config-driven library. The integrations share similar implementations and business logic. We plan on adding at least three more integrations, which means six Node packages plus two more (shared tests and code/interfaces). Managing eight packages will be challenging for our small team.
Each processor expects different requests, returns different responses, and supports different features. Despite this, I believe these packages are great candidates for config-driven development (CDD) because the implementations are nearly identical, the dependencies are the same (they all use the same request, logging, and processing abstractions), and expose the same interface. The goal is one integration, one configuration, and multiple use cases.
While thinking about this work, I reflected on a similar migration I did at Stitch Fix. Below is what I learned from that migration. I hope this helps you (and future me).
Highlights
- I migrated three business logic-heavy recommendation engines into one config-driven recommendation engine that can be deployed separately to serve each use case.
- Through this migration, I made the recommendation engines transparent and replayable — for the first time in Stitch Fix’s ten-year life, PMs, buyers, data scientists, etc., could figure out why items were recommended without asking engineering.
- Reduced learning curve; all the apps shared the same structure and code paths, so new engineers only needed to learn the business rules.
- Reduced on-call burden; now every issue could be debugged by replaying the request that led to it, and since all the business logic was centralized in the config, it was obvious where to start debugging. Plus, there was only one code base to keep track of.
- Centralized business logic: one source of truth for our business partners’ questions and one place to change to set up experiments.
Learnings
Start small and simple
- The goal is not to eliminate all redundancies — CDD pays off if it reduces code by any amount than it adds
- Removing all redundancies may be impossible or too expensive. Don’t let the perfect be the enemy of the good
- CDD moves complexity, does not remove it — complexity moves from the code to the configuration, so focus on making the configuration as simple and small as possible
Avoid these time sinks
- Overthinking/Planning — you can learn fast and a lot from a quick prototype
- Do you have a question your team is debating constantly? Create a test, run it, and settle the debate. - Early generalization/abstraction — don’t delay feedback (completion and launch) for maybes and ideals, get something working and test it, then generalize
- Over-engineering — building features or abstractions you will not use immediately is another form of feedback delay, avoid it
- Bad assumptions/missing info — see the next point
Plan your migrations to optimize your learning not speed
- Eliminate as many unknowns as possible in the initial migrations, what you don’t know is what will blow up your timeline
- Focusing on filling those knowledge gaps will naturally lead to faster migrations as opposed to planning for speed initially and then finding out your missing a crucial piece of information or your assumptions are wrong
Performance should always be on top of your mind
- Migrations can often devolve into performance optimization hell if you are not keeping track of performance every step of the way
The configuration is NOT an interface; don’t build it like one
- This will lead you to a configuration that is overly abstract, i.e., hard to understand and easy to break
- It is okay if not all fields in the configurations are shared — it is better to have lots of specific configuration fields that don’t apply to all deployments than something that is overly abstract and can’t be understood without looking at the code it executes
- Think of the configuration as a manual for operating a machine — the machine has one form but different functions — a good manual lets you operate these functions efficiently without needing to buy a machine for each task
You can read about the entire migration in detail below.
Quals
Stitch Fix (SF) is in the recommendations business. Customers give SF their personal information and preferences — location, size, height, favorite colors, etc. SF then uses that data and inventory to create an initial list of clothes that the customer can wear — think hundreds to tens of thousands of clothing items (SKUs) depending on the customer’s preferences and the available inventory. These SKUs then go through various models and show up in front of stylists and clients. My job for a year at SF was to maintain the 2nd step — the initial recommender that created the primary list of SKUs. This is Quals. Every article of clothing that is displayed on SF’s website, apps, and stylist portal was first filtered by Quals. Every item that was bought was picked by Quals.
Quals consisted of 12 apps — six business lines with two versions each, v1 (batch — DB queries) and v2 (realtime — API requests), split across two regions. Production was pointing to v1 for some apps and v2 for others — they were in the middle of migrating to real-time features, which I helped with, but it was a slow multi-year and multi-department process. Quals was prone to random outages and broken business rules. Only one engineer had enough insight to diagnose and fix these problems; of course, he was busy doing other projects. Quals was not healthy.
However, my manager was determined to fix and stabilize this environment and had his eyes set on Quals. During my first week there, he set the agenda: stop the outages and reduce the maintenance overhead. He was not afraid to delay features and take risks to solve systemic issues and unlock future growth. He saw what an albatross Quals had become on the platform team and the wider Data Science org. It was clear Quals required a redesign. My manager gave me room to explore different redesign approaches.
Initially, we decided migrating all apps to real-time features (v2) would reduce overhead and give us room to investigate the outages. However, we quickly realized that not all features could be received via API calls, and the teams that owned these features did not have enough capacity to work on these changes. While debating how to migrate these features and if it was a good idea for us to do so now, I came up with another approach to solve our problems.
I had finished onboarding and learning from everyone who knew anything about Quals. I realized, like my teammates did, that all the Quals apps did the same thing. Get the available inventory, get the client’s features, get the SKUs’ features, and apply business rules on those feature sets to filter the SKUs and return the result. My unique perspective was that I saw all these parts (inventory, client features, SKU features, and business rules) as inputs.
The source of the inventory or feature data did not matter; the business rules also did not matter. Quals needed to know where to pull the data from, which business rules to apply, and a pipeline to orchestrate and execute these tasks. Everything could be loaded at runtime once these inputs were available. This was the inception of the configuration (config).
A JSON config that lists the feature and SKU tables or APIs and a list of business rules to execute. This config is consumed by a transformer that loads and transforms the data, creates the pipeline, and executes it. I pitched my configuration idea to the team and got a green light. A week later, I had a working version. In less than a year, I had migrated all the US apps into one configuration-driven platform while doing my primary job: maintenance and support (on-call rotations, launching experiments, updating business rules, fielding questions and feature requests from data scientists & ML engineers, etc.).
Inkling to Prototype
I had spent a week going back and forth on the idea. When given a big problem to solve, it is easy to cope by over-analyzing and over-planning. Instead of following our instincts, we often do what feels the most comfortable, which, in my case, is to think and plan. My instincts told me that these six apps were just one and that no one bothered to take the time to combine them because the apps worked so well for so long. Quals, like most apps at SF, was someone’s pet project and was given enough resources to scale but not enough to scale well.
This was a great way to test lots of ideas and explore the solution space, but it was a poor way to run a software company with long-term goals. I saw this in the repo that contained all six apps, the poor folder structure, the desultory commit messages, and the inscrutable deployments. But I did not act on my feelings because everyone around me was so brilliant; of course, they knew this problem and solution and must have dismissed it for some reason, right?
I wasted time trying to tear down my idea. I looked through each app and scrutinized the business rules, features, and output. It seemed even more plausible with every investigation. So, I asked my team, and guess what? Either people thought of it and then moved on to other things or didn’t even think about it. What is obvious to us is often only obvious AFTER we’ve done the work to gain insight — we easily forget all the work and luck that goes into gaining insight. Everyone gets an inkling, but most people don’t follow them, for good reason, too. They have incentives and priorities that send them down different roads. When you have a feeling, tell someone and follow it.
It only took me a day to create and run the first configuration. I proved Quals could run on a configuration in a day. By the end of the week, the first app was running on parity with the non-config version. There were many bugs, of course. But once I had proven it worked, it did not matter how long it took to fix all the bugs. I had demonstrated a viable path forward.
The First Migration
The first migration was to distill the US Mens app into a configuration and use that configuration to create a data pipeline using SF’s internal pipeline creation and execution library Flight. The hardest part was figuring out what to include in the configuration.
The first step was getting the client features, which included the client’s warehouses needed to get the available inventory. Once this data was retrieved, I could use it to get a list of available SKUs and each SKU’s features. So I created a field in the configuration to point to the client feature’s API. Then I realized US Men’s production used batch features, not real-time (the real-time app was not getting any production traffic, which goes to show looking at the code and running it is not enough, gotta look at the traffic). We want to move to real-time features but how can I build the config to support this?
I spent too much time going back and forth on this, I should have written down this thought and saved it for later. But I wanted to be a good little engineer and future-proof my config by making it as flexible as possible. This temptation to generalize your config is a waste of time — you do not know when or if you will use these capabilities. Treat building the configuration like building any software; only build the features you know will be used. Future-proofing is gambling — you are making a bet that spending an extra day adding new features will pay more than finishing early or on time, which is generally a bad bet because you are delaying feedback. I decided to pull the batch client and SKU features only; I wanted to validate the config as quickly as possible.
The next field I added to the config were the business rules — these were app-specific filters that reduced the number of SKUs based on the SKU and client features. Each Quals app had a distinct set of business rules, covering everything from size, color, graphics, etc. Some of these rules were similar but not identical. The input and output of these business rules were identical — each took a list of SKUs and SKU features and one or multiple client features, then returned a list of SKUs. Capturing this information involved creating a field in the config that was an array of objects. Each object specified a list of SKU and client features and a function that did the filtering (copied straight from original Quals). I generalized these filter functions to work for all Quals apps — this was another early generalization that I later found out was wasted time. In Flight, the array of functions and inputs was turned into a pipeline of filter operators that pulled in the input data and then executed the filter function. The pipeline returned the last output from the final business rule. The result was that the end users could not distinguish between SKUs generated from the config and the original app. That was it, easy, right?
It was easy until I ran a profiler and saw how much slower the Flight pipeline was than the series of loops and bit operations that the original version of Quals used. The pipeline was adding too much overhead — although it gave us a level of introspection, logging, and replayability that Quals never had before. For the first time in SF’s 9-year history, we could tell why a specific SKU was presented to stylists or clients. This was a happy side-affect of the migration. I was trying to fix this unhealthy app I was given, but I unlocked a transformative tool for debugging and maintenance. Best of all, our non-technical partners could now use the plethora of UI tools that were built for Flight to do this introspection — self-service debugging!
But the latency was too large. I spent much time with my teammates (happily, we had the two engineers who built Flight available) finding ways to lower the overhead. This involved reducing the input size, optimizing the filter functions, and refactoring the Flight filter operator. In the end, we got it low enough to be operable, but the performance problem would reoccur in the last app I migrated.
From 1 to 2
I experienced a lot of churn migrating the second app, Kids, because of my poor mental mode of what a config is and does. I repeatedly updated the config to support Kids. With each update, I beat myself up thinking that the abstraction I wrote was misguided and ill-defined. It was all needless pain. I treated the config as an interface and suffered for it.
A configuration is a manual for operating a machine; you will use different parts of the manual and ignore others, which is fine; each task the machine does is related but distinct. An interface is a blueprint for building the same thing without deviation in form or function, but the point of configs is to allow specific and obvious deviations in function. When you discard this mindset, you will focus on what matters in the config — readability and flexibility, not the quality of the abstraction or clean boundaries. Following this process, you will eventually find a natural abstraction that fits your needs without the churn and pain.
Kids was nothing like Mens. Kids, at the time, was the newest business line, so the business logic was still in flux. Furthermore, it completely departed from how SF treated gender, size, and other core preferences. I chose Kids because I knew it would force me to make the config flexible enough to handle the most eccentric business rules. If the config could handle Kids, it could handle any other app.
I could not copy and paste the filters from Mens onto kids, so I pivoted. The config runner went from one config controlling the inputs and order of many shared generic filters, to a config agnostic of filters. Now, all it needs are the files containing the filters. This change decoupled the configuration from the filters, and I did it at the right time when an app needed distinct filters.
At first, I assumed that all business lines shared the same primary filters and rules, so coupling the config and filters made sense. This assumption was due to tunnel vision; I focused too much research on Mens and Womens (the most used apps) who shared a lot of logic. Migrating Kids showed me that there were more differences than similarities.
Now that the config could support unique rules, migrating Kids became much easier. It was a matter of copying over the filters and modifying them to be run in Flight. Yes, the amount of shared code decreased, but it was better than dealing with a separate app just for Kids. Now, I had two apps running on the configuration.
Womens, The Biggest Migration
Migrating Womens was hard. Womens had the most traffic, business rules, SKUs, and features. Building up to it with Mens and Kids taught me to look for performance issues early and how to wrap unique business rules. At this point, the configuration was stable and flexible. Initially, most of the work was migration — not config design or transformer optimization like Mens and Kids — but that quickly changed into performance optimization.
Migrating all the Womens’ business rules in one go failed. All the regression tests failed, but it was worth a try. I learned certain filters did the most filtering. In Mens and Kids, the size of SKUs being operated on was relatively small, so I did not notice that specific business rules reduced the number of SKUs the most. With Womens, the pipeline was filtering hundreds of thousands of SKUs. I focused on these super-filters and moved them up the pipeline. This got me closer to passing the regressions — went from tens of thousands of different SKUs to a few hundred — and improved performance — the bulk of the filters were now operating on a reduced set of SKUs.
Then came the slow grind of migrating the other rules, figuring out what was wrong, fixing it, and checking performance. This took considerable time because simply copying the rules would cause serious performance issues. The entire project quickly became a question of performance, not feasibility or benefits. The work was optimization, not migration. I did not foresee this, nor did my teammates. I thought we may get a performance improvement moving from a series of for loops to a pipeline, but I was naive; the pipeline was just a better form of this series of loops.
By the end of my year there we managed to migrate Womens; I say we because I had reached my limits with the performance issues and would not have solved them without my teammates. Like the config, sometimes you just need to add a new variable to get over your problems and get to a place where you can see the big picture and find the right direction to take.
Deployment
Deployments became easier. We could now deploy Quals to each target app — Kids, Mens, and Womens — from the same repository and branch. The server that wrapped the config transformer detected the type of client the request was from and loaded the appropriate config. This made deployments easy — push master to the app that needs to be deployed.
Pre-migration, we would need to go into the correct directory to find the right version of each app, switch to the correct branch, and deploy from there (multiple branches were deployed to production because some branches conflicted with each other). This was an error-prone system and easy to mess up in the middle of the night during an outage.
Impact
Did I accomplish all my goals? I left Quals, my team, and Stitch Fix in a better place than I found them a year earlier. Quals was healthy, there were no surprise outages, debugging was easy, and our business partners loved that they could see how their decisions impacted our customers.
Could I have done this better? Yes, without a doubt, but that’s a good thing; it means I challenged myself and did something worthwhile. Taking the most optimal route would mean avoiding risks that could pay off or taking too long to plan. Learning happens on the side roads. The beaten path is safe, but you won’t find anything worthwhile there.
Aside
I consciously delayed the UK migrations due to the nascency and size of those businesses. This turned out to be a sagacious decision. Less than a year after I left, SF announced the closure of the UK businesses. Sometimes procrastination pays off, call it an economy of action.