I Fix Broken Things
More than 20 years ago, I got hired at my first “large” company. It was a financial institution with about 350 employees, and an IT department of about 30 people split between operations and development.
On my first day, the guy I was replacing came in just to show me the most important part of my new role: keep a critical risk assessment tool online.
Now, you have to remember this was the early 2000s, in a company with very little technical expertise. I was still a young professional, and some of the SRE principles I breathe like air today were not even a thing.
That said, the gentleman showed me the setup. The risk assessment tool ran on an RISC (Compaq Alpha) server running Red Hat Linux and JBoss. For some reason (which I figured out months later and fixed), the system crashed multiple times a day. My main duty was to keep my ears open and, as soon as someone yelled that it was down, I should run to the datacenter, switch the KVM to the right server, log in as root, and run a short sequence of three or four commands that would bring the system back up. Simple enough.
But on my third day at work, I got annoyed. I mean, really? Maybe I was just too lazy and didn’t want to get out of my chair. So I created a regular user I could use to SSH into the server and then switch to root. I also created a tiny shell script with those three or four commands that fixed the problem. No rocket science.
The other thing is that we had an open office, and the IT crew sat very close to the risk assessment team. I was especially close to some of the power users. So I kept my ears open and watched for the early warning signs—when they started chatting (“Hey, is the system working for you?”), and I immediately ran the script (I never closed my session).
Just out of annoyance, I reduced the MTTR (Mean Time to Restore) from 5–10 minutes to about 30 seconds. Sometimes I fixed it before most people even noticed it. Of course, back in the day I had no idea what MTTR was, and no idea that I should have bragged about it to my boss.
It took me years, many jobs, and helping different organizations improve before I realized that my real skill is not Linux, or Python, shell script, or network architecture and AI. What I am really good at is finding broken things and fixing them. When I was younger and less experienced, that was mostly technical. As I matured and moved into leadership roles, it became a mix of people, processes, and tools.
It is still hard for me to explain to non-technical people what I do for a living, but I think it is fair to say I try to keep problems from happening. And when they do happen, I try to keep them from coming back. And in the middle of all that, I’m always looking for ways to make things better for everybody.
Over the years I’ve learned that my best work tends to show up in the unglamorous places: the systems that are always “almost working,” the alerts everyone got used to, the recurring issues nobody has time to chase. I enjoy making sense of that mess, tightening the feedback loops, and turning it into something stable and predictable. It’s the same instinct I had back in the early 2000s—just applied at a different scale.