|
I started working at Netflix in February 2025. Like most new engineers in a new role, I was given an onboarding task, something simple and straightforward designed to get me familiar with the codebase and get a quick win. Onboarding tasks are meant to be simple and light weight, essentially a task you can finish it in a relatively short amount of time and get a quick win. My onboarding task fit all those criteria. It was well-scoped and the context was clear. I dove in, got my development environment up set up and made the code change. I wrote unit tests and integration tests to verify everything worked as expected. All green. Everything looked good. While working on the task, I noticed something odd. The API I was modifying was taking two arguments that appeared to be duplicates. It looked like redundant code that could be cleaned up. So I thought, “Why not fix this while I’m here? Kill two birds with one stone and make a good impression.” So I did. All previous tests were passing after I made this optimization and everything looked good. I made the first pull request, and honestly, I was proud of it. The PR was reviewed and approved by everyone on the team who had context on the service. I merged it after approval and the change went live. As someone who loves building things that reach people in real time, I was thrilled. This wasn’t just any service—it was part of Netflix’s streaming infrastructure. My code was out there, running, affecting millions of users. But that excitement didn’t last long. As the change rolled out to different regions, we got paged. Some users were experiencing streaming issues. The engineer on-call investigated and traced the problem back to my pull request—specifically, to that “optimization” I had made. My PR was immediately rolled back, and I was notified. My heart sank. After my own investigation, I discovered that the “redundant” argument wasn’t a mistake. It was there by design. An upstream service depended on that exact structure, and when I removed it, I broke their integration, which in turn broke streaming for some users. No one on my team knew that the dependency on the upstream team was structured that way, which is why the PR had sailed through review. I had to present the incident to my team and later to my org. That was my first presentation at Netflix, by the way. Talk about an introduction. During the presentation, I focused on what I could have done differently—how I shouldn’t have assumed the code was redundant, how I should have asked more questions etc, etc. I showed up taking full responsibility and blame for everything that had happened. I instantly got feedback after my first presentation urging me to take a different direction for future incidents. While taking full responsibility is expected, I was advised to present incidents in a different way in the future. The goal of the review was not to point blame. Multiple people reviewed the PR, so this wasn’t just on me. What mattered more was asking: What can we put in place to prevent this from happening again? How can we work better with upstream teams to understand how they’re using our service? That shift in perspective stuck with me. I walked away with several takeaways that have shaped how I approach development ever since: Keep changes small. One PR should do one thing. If you’re making a large change, break it into the small chunks. This limits the blast radius of potential issues, makes roll backs easier and makes reviews more effective. It’s much easier for teammates to carefully review 3-4 files than 54. Don’t make assumptions. That “obviously redundant” code might be there for a reason you don’t know about. Always ask. Reach out to people who might have context. Mistakes will still happen. You can follow all the best practices and still cause incidents. That’s software development. What matters is that you take responsibility, learn from it, and work to prevent repeating it. My first PR at Netflix didn’t go how I expected. But looking back, I’m grateful for how me and my team handled it. The incident taught me more in two week than I could have learned from months of smooth sailing. If you’ve ever shipped something that broke production, you’re not alone. We’ve all been there. The key is to grow from it and using it as a learning opportunity not just for you but for your whole team and company. Uma |
Helping software engineers grow their skills and income. Join 500+ others on The Code Room waitlist and stay in the loop.
Here's what I've noticed: most people who stop coding outside of work don't stop because they hate it or don't have time. They stop because the gap between having an idea and executing on the idea feels impossibly wide. Let me explain. How We All Started Remember how you got better at coding when you first started? You simply wrote more code. You built projects. A to-do list app here, a weather dashboard there, and maybe a portfolio site to showcase all the projects. Every new project exposed...
I want to tell you about the time I poured almost a year of my life into a project, only to realize it was never going anywhere. For some context, my first role as a Software Engineer was on a networking team, focused primarily on networking automation. Network engineers spend a lot of time making configuration changes to routers, switches, and other networking devices. My job was straightforward: help them build software that automated these repetitive tasks. That I could do! At the time, I...
Companies are mandating the use of AI coding agents, and honestly, it makes sense. In the corporate world, you're responsible for outcomes, not the amount of code you personally type. Whether you wrote it or an agent wrote it, they really don't care. Trust me, it hurts to say this as someone who genuinely enjoys the craft of writing code, but it's the truth. AI is here to stay, and it will be incorporated more and more into what we do as software engineers. Some companies are beginning to...