Think about the hardest project your team has ever worked on, or maybe the project you've avoided because you think it's impossible.

Sometimes you don't know yet what will be hard. If you're building a brand new product or feature, you probably can't predict what the limiting factor will be, and that's fine - you should just get something built and go from there. But sometimes there's an obviously hard part. Maybe people have made attempts before that revealed the hard part. Or your first version gets to production and you find out once it gets real usage.

Let's assume you know the hard part, and you don't just walk away from the whole thing, which is what many teams unfortunately do. What will your approach be given the existence of this particularly hard part? At a high level, you've got a couple options. There's one where you do the safe, incremental approach. You're trying to minimize the risk, and many teams choose this path. But there's probably some lurking reason why that approach isn't actually going to work. In practice, what you're realistically doing is avoiding the really hard part. So you have to ask: can we avoid the hard part and still succeed? If the answer is yes, then by all means avoid it. But if the answer is no, then the incremental path isn't actually safe. You're just delaying the inevitable failure.

There's another option where you attack the particularly hard part head on, likely requiring you to take some big, risky swing. One of my favorite problem solving approaches is to hold 1 thing constant and assess the rest. In this case - assume you take the big, risky swing and it actually works. Is the rest of the plan solid such that the overall project is likely successful? If the answer is yes, and if you really want to tackle this project and succeed, you should take the big swing.

This sounds really simple, but a couple important things need to be true in order for teams to do it. I'll dig in on those, but first, a couple stories where I saw this play out.

A story: Stripe v2 Accounts

When I was at Stripe, I worked on Connect, where we enabled platforms to embed payments and financial services. We believed it would be incredibly powerful if a platform and its underlying users could unlock network benefits across Stripe's product suite. Maybe you have a user who you need to charge a subscription fee (customer), who also wants to accept payments (merchant), and to receive money (recipient). We can assume this user has a balance on Stripe because they've accepted a card payment for something they sold and they've received a transfer from another Stripe merchant. And they want to use that balance to pay your subscription fee. We knew this user was the same in all these different contexts, but our abstractions didn't enable a unified representation of that user. We had v1/accounts, v1/customers, and so on, so we couldn't really unlock this without a major new abstraction. This was the motivation behind v2 accounts - a single abstraction that could handle multiple configurations.

Many people and teams before my time at Stripe had envisioned this and had made attempts to tackle it. It was ambitious. And if you were to ask "what makes it challenging," the simple answer is that the entirety of Stripe's abstractions and business logic would need to learn how to work with a v2 account. Even if you bit off 1 part first, like replacing merchants with v2 accounts or customers with v2 accounts, the surface areas were each massive, because it's literally every API and every bit of business logic within Stripe's codebase that contemplates payments or billing. Plus, you only get the benefits we were after if you replace at least 2.

When I started working with the team, the plan of record was to stand up the new abstraction and data model, and have it live alongside the old ones, with two-way syncing. This way, most of Stripe's business logic didn't need to understand the new abstraction, and could keep using the old ones. Consider this the safe, incremental path, that largely avoided solving the hard part. The lurking problem with this path was that the syncing would be brittle, and we'd have a scale problem real fast. If someone created a v2 account configured as both a merchant and a customer, we'd have 3 underlying models to keep in sync, each with copies of the same data (like their name and email).

The team went down a bunch of paths to try to make this viable, but the reality was that all of the options were avoiding the lurking problem that would likely turn out to be a showstopper. We got a small group of engineers together for an offsite where we discussed how to handle it. We found ourselves circling around another path, where we would attack the challenge head on. We could build an interface that would replace every read and write callsite and ultimately just write to our new data model. We called this "encapsulation." It had been suggested in the past, but everyone ignored it as a real solution because it came with an insanely high estimate in engineering months (engineering years, really). This time, we played it out for a second: what if we held constant that we could pull off encapsulation - would the overall project be successful? The answer was yes. But it was crazy.

Lots of people avoid a problem like this at all costs, but other people gravitate towards the hardest things. Turns out we had a couple of engineers who got really fired up about an absolutely insane challenge like this, and we empowered them to solve it. We considered a few ways of running the migration, like federating the work out to every team that owned callsites, but we ended up keeping it centralized and opting for codemods, because wrangling that many teams with their own priorities was clearly going to be a silly mountain to climb. Throughout this, I worked with the team to push forward despite the very real possibility that encapsulation became too long of a pole, because we really came to believe that it was the only viable path.

This was back in the early 2020s, pre coding agents, so the estimate really was multiple engineering years of work. Despite a rollercoaster of challenges, the team pulled it off. I don't believe v2 accounts could have shipped without it.

Another story: Claude Managed Agents

At Anthropic, my team recently released Claude Managed Agents in beta, an API suite for running agents. When we built our first version, it wasn't obvious what the hard part would be, so we did what you should do with a brand new product: we built the simplest thing that could work. The API spawned a sandbox and booted up Claude Code inside of it, and everything about an agent's session lived in that container. We used it internally, gave early access to a small set of customers, and started collecting feedback.

The issues showed up pretty quickly. Latency wasn't great, because a container had to spin up before Claude could even start thinking. If the container died, the whole session died with it, which made reliability rough. And we didn't love that the code Claude wrote ran right next to MCP credentials.

For a while we treated these as separate problems, and we tried to patch each one. The team got better at nursing failing containers back to health and spent long stretches debugging stuck sessions where you couldn't tell whether the harness, the event stream, or the container had failed. Eventually we admitted to ourselves that every one of these problems traced back to the same choice: everything ran in a single container. There was our lurking problem. Fixing it meant separating the brain (the harness loop) from the hands (where code executes), and that meant building a real distributed system, which was significant complexity that the original design avoided.

We were close to a public beta at this point, and delaying a launch to rebuild an architecture is a painful call. But we asked ourselves the same thing: could we go to beta on the single container architecture and still succeed? We knew we couldn't. So we delayed and did a full rearchitecture. When we brought early access customers back in, the credential and reliability problems were gone and time to first token had dropped significantly. We took this system to beta.

Maybe we could have predicted that reliability at scale would be the hard part that we'd have to solve in order to succeed. But for a brand new product, iterating to the answer was better than letting perfect be the enemy of good out of the gate. The actual mistake would have been going to beta anyway after we figured it out.

Enabling a team to take the big swing

The reason teams avoid the hard path is pretty rational. The hard path comes with a massive, low-confidence estimate, and a team needs to take on huge accountability and risk to go down it. The path that avoids it has lurking problems that show up later, in ways that can ultimately lead the whole thing to fail. But that path sounds safe and therefore quite tempting. So you go down the safe path, sometimes for years, even while a bunch of people believe it will fail but push forward anyway.

If you want to be a team that takes the big, risky swing, 2 things have to be true.

First, you need people who can get really fired up about the challenge. You don't need consensus, and you definitely don't need everyone to believe it will work. Most people will look at the hard part and assume it's impossible, and that's fine. You need the ones who see the most interesting challenge they've maybe ever gotten to tackle. Give those people the problem and let them take a lot of ownership over how to solve it.

Second, as a leader, you need to be willing to take accountability and cover the team to go after it. If it doesn't work out, it's on you, and there's a host of consequences that could come with that. Taking on this risk requires you to personally feel 2 things: first, you actually really want this project to be successful, and second, you actually believe this big, risky swing is necessary.

The line is shifting - some hard things are becoming easier than they used to be. If my team at Stripe was writing code with Claude Fable, we may not have found encapsulation to be so daunting. What's really happening is that AI is leaving all of us with the truly hardest parts. So really, the crux-y question hasn't changed: can you avoid the hard part and still succeed? If the answer is no, the safe path is just pushing your failure out. If you really want to succeed, you have to take the big swing.