The Wizards of the Stack
I was about 23, working at my first “real” job in IT. By then, I already had about five years of experience, but I had mostly been doing freelance work or working under the table, as was (and still is) common in my native country, Brazil.
I was really proud of that job. A bank had hired me to work on their server infrastructure, networking, and security. I was a college dropout who still pretended to attend classes once in a while at night. More often than not, I was moonlighting as a consultant instead.
It was a very small private bank with only a few hundred very wealthy customers, both individuals and companies. Not exactly a retail banking empire, and the IT department reflected that.
The online banking platform was surprisingly advanced. In many ways, it still beats what some major Canadian banks offer today. The reasons why could probably fill a few posts on their own. But despite all that, we had a single server running the entire online banking system.
One physical server.
In a small server room.
Naturally, I cared about that server the way someone cares about a pet. I was young, inexperienced, and figuring things out as I went. The best operational strategy I could come up with was to log in every day, run diagnostics, update software, restart services, and run Tripwire and rootkit scanners religiously.
I had learned early in my IT career that one of the best ways to avoid being hacked was to stay current with security updates and avoid known vulnerabilities.
So, naturally, on Friday, December 19, 2003, the day after Linux kernel 2.6.0 was released, I decided to upgrade our production online banking server.
Did I mention I was young and naive?
It went exactly as well as you would expect.
My custom-configured, custom-built kernel panicked and refused to boot. To make matters worse, I somehow broke the LILO configuration, and the old 2.4.x kernel would not boot either.
Everything that happened next is a blur. After enough panic and no solution in sight, I grabbed the phone and called a friend. I don’t even know whether I made sense during that conversation, but I explained my mistake and read him the panic message.
He hung up.
About thirty minutes later, a patch arrived in my inbox.
I booted the server from a live CD, installed the newly compiled kernel, and held my breath.
It worked.
Everything came back online.
To this day, I am grateful to him, and I still think he is one of the smartest people I know.
At the time, it felt like magic.
Years later, I would realize it wasn’t.
A few years after that, I was working at IBM in the NetOps group. The team had built a highly customized firewall layer with extremely granular access controls that allowed IBM staff to access customer infrastructure. Only a handful of people had root access, and only the architect who designed the system truly understood how all the pieces fit together end-to-end.
I wasn’t there when it happened, but after one particular weekend, the story spread quickly.
A scheduled maintenance window had gone badly wrong. Systems failed spectacularly, and the engineers performing the work had very little understanding of how to restore the environment.
As a last resort, they called the architect.
He was at a party.
Possibly very drunk.
Legend has it that he sat down, grabbed a napkin, and wrote a shell script by hand. He then handed it to another engineer who was apparently sober enough to read it. That engineer dictated the script over the phone while someone else typed.
The script ran.
Everything came back online.
Again, it looked like wizardry.
Again, it wasn’t.
Another fifteen years passed.
This time I was working at Google.
I had finally made it. I was officially the least smart person in the building.
An issue had been making the rounds for months. One of our partners had reported it, and my team kept cycling through the same troubleshooting process: metrics, packet captures, dumps, logs, evidence… everything.
Nothing explained what was happening.
I reviewed the investigation and all the work that had gone into it. To be honest, I could barely understand what they were talking about.
The people working on it were not random engineers. They were experienced Sysadmins and SREs with years of experience at Google.
And they were completely stumped.
We had an escalation path to a software engineering team that worked closely with us. One of their engineers was a tcpdump savant. You could hand him a packet capture and, without using any tools, he would tell you exactly what was happening. Better yet, he would explain it so everyone else could understand.
He spent weeks on the problem.
No progress.
He couldn’t figure it out either.
But being Google, he knew someone else.
As it happened, one of the Linux kernel developers responsible for portions of the TCP/IP stack worked at Google.
The engineer reviewed the issue.
A few minutes later, he identified the problem.
Linux was mishandling a specific class of network payloads.
He wrote a fix that same day, rolled it out across thousands of servers, and shortly afterward merged the change into the Linux kernel mainline.
To everyone watching, it looked like magic.
For the third time, it wasn’t.
None of these people was magical or enlightened.
They were simply operating within their area of expertise.
My friend understood C and Linux kernel internals at a level I did not.
The IBM architect understood every detail of a system he had personally designed.
The Google kernel developer owned part of the code that was causing the problem.
To everyone else, they looked like wizards.
To them, they were just doing their jobs.
One lesson I learned over the years is that expertise is fractal.
No matter how senior you become, eventually you will encounter a problem that belongs to someone else’s layer of the stack.
I’ve had many one-on-ones with engineers who were frustrated with their own performance and understanding. One of them once told me she felt stuck, as if she was constantly running into problems she couldn’t easily solve.
Here’s what I usually tell people.
Once you become proficient at something, you stop consciously noticing how much you know. You move through familiar problems effortlessly. You solve issues or prevent them entirely, with very little visible effort.
Because of that, you don’t always recognize that you have become one of the people from the stories above.
When you’re good at something, it feels easy.
And because it feels easy, you tend to undervalue it.
What you do notice are the things you cannot solve.
But those are exactly the problems you haven’t mastered yet.
As you become more capable, people send you harder and harder problems. The easy ones disappear from view because you’ve already absorbed them into your muscle memory.
Ironically, feeling stuck is often evidence of progress.
Beginners struggle with beginner problems.
Experts solve those so easily that they stop noticing them.
The only problems left are the ones sitting at the edge of their understanding.
Every engineer in the stories above eventually found a problem they couldn’t solve.
The difference is that each of them had already become the person everyone else called when they were stuck.
If your work regularly leaves you feeling out of your depth, it may not be a sign that you’re falling behind.
It may be a sign that you’ve already mastered everything beneath you.
After all, back in 2003, I was the panicked engineer staring at a dead production server.
To someone else, I was probably the wizard.