The build/operate trade-off in engineering cultures

January 19, 2024

I’ll be the first to admit that the most fun part about being a software engineer is the act of writing code. I can’t think of any other trade where the feedback loop between trying something new (a few keystrokes) and seeing if it worked (yet another keystroke) is so fast. Each compile-run cycle is its own dopamine hit. And there’s nothing wrong with that! The joy that comes from the instant feedback of programming is what attracted my quick-to-wander mind to software engineering in the first place.

The problem comes when you don’t reconcile the lightning-fast coding feedback loop with the much longer software lifecycle feedback loop. As you quickly learn, there’s so much more to do after you get something working on your computer. You also have to make sure your work builds and deploys in a remote machine and then and only then does the real work begin. The real feedback loop emerges. If you instrumented your app correctly, you’ll start to see the world in all of its complexity. Maybe your work crashes in strange and unexpected ways. Maybe it doesn’t perform well for everyone or you start seeing feedback coming in about how users are confused.

I’ve been part of many different engineering cultures over the years and one thing that made them all different from each other is how they prioritized building vs operating software. In one corner you have the “move-fast-and-break-things” engineering cultures. They build features and move on. They throw most of the monitoring and support over the fence to customer support, DevOps, etc. They hear of problems mostly through customer reports.

In the other corner we have operationally-focused engineering cultures. During my time at Quidsi/Amazon we called this “Operational Excellence”. A feature was not “done” until it had robust instrumentation — logs, alarms, dashboards — all in the service of seeing what’s working and not working in production so we could properly support what we built. Having all the instrumentation in the world is not enough, of course. Following through was the most important part and it was codified and glorified in the engineering culture. These engineering orgs hear of problems first from their tooling.

Build-focused engineering cultures can move fast but the output can be disappointing. Efficient but not effective. Operationally-focused engineering cultures are rigorous and correct but can be slow and stifling. It’s a trade-off because the time that you spend on the shorter programming feedback-loop is time that you’re not spending on the longer software lifecycle feedback loop.

If you don’t do anything, the default startup engineering culture leans heavily towards building at the expense of operating. I think that’s wrong. At the same time, I don’t think the right answer for startups is to start with a heavy operationally-focused engineering culture (there’s nothing to operate yet!). But if you’re not careful and intentional, you’ll never get there.

This is one of the things I think about the most at Meadow: what are specific, intentional things we can do to move us to the right build/operate equilibrium at this stage of the company? We built a daily team ritual of monitoring and triaging production issues around Sentry. We began a prod-support engineering rotation on same day we launched Meadow Pay. We’ve also made strategic early investments (much, much earlier than you would expect) in automated end-to-end testing and purpose-built alarming systems. Yes, those projects delivered the value we expected, but the bulk of the value came from the nudge to our early engineering culture along the build/operate spectrum.