Software Engineer Failures
Preface for the start of a new series focusing on software disasters
I received a critical comment that read: Be concise. Don’t waste my time. You consolidate all the shit and deliver what your title says. I found the tone unnecessary (“don’t waste your time? I’ll waste as much of your time as I want! Here, watch this 3-hour video of a stick of butter at room temperature!”), but going forward I plan to include summary sections in my introductions.
This post will be:
- An explanation of a new series I will start called “Software Engineer Failures,” and my motivation for starting it
- That’s all. It was originally supposed to detail Therac-25 by drawing from this paper and a publicly available lecture from UC Berkeley, but it was getting so long I decided to split it
I think the critical comment was also responding to clickbait in general, and I will take some time to unpack that. On one hand, clickbait is an actual problem; on the other, I do not see how “Making Sense of Tech Layoffs” is a clickbait title, or how the rest of my writing failed to deliver on something as vague as that.
My writing has been characterized in many ways, but I don’t know if anyone has ever called it “concise.” That sentence alone was a perfect example. I like to think it is part of my style, but this is not the first person who was presumably annoyed by the jokes, random tangents, and not-always-necessary YouTube clips.
Software Engineer Failures
One of two things is probably going to happen, going forward: Either we will enter a phase where LeetCode no longer matters, as everyone relies on metrics like open source contributions, portfolios, and referrals…or we will enter a phase called the “Super LeetCode Era,” a time period where everyone has to solve three online coding assessments, six technical interviews, and a psych analysis to get anywhere near a company like Apple or Microsoft. Why? Because many qualified people are getting laid off from big tech companies. They are competitive, they are experienced, and they have already proven themselves “LeetCode-ready.”
One thing LeetCode-style interviews miss is system design interviews, which are arguably much more relevant than their counterpart. I thought it would be nice to write about actual systems, not the toy ones we pretend to create in system design interviews, but it got me thinking about something else…
…failures.
If you become a software engineer because you successfully solve the Compressed String LeetCode question in 45 minutes, well done. You have dealt with pressure, you have demonstrated competency, and you have successfully worked with the person you will now work with professionally. This accomplishment is legitimate.
But in the actual field, the code you write has a certain “gravity.”
Levels Of Failure
Because of the failures of the Therac-25, people died — to put it as simply as possible, it was a controlled radiation therapy device that delivered too much radiation. The paper linked above indicates that there was a race condition, but what I did not realize until reading it is that there were other problems. The UI was not intuitive, and more or less prompted disaster in that it made small, confusing error messages look exactly the same as catastrophic warnings. The device was an “upgrade” in the sense that the software was new and supposedly safer, but the hardware removed a fail-safe mechanism that had saved lives in previous versions.
One reason the story probably sticks in so many minds is because of the horrific results. Patients experienced a burning sensation, in some cases followed by loss of function in limbs, nausea, disorientation, and death. Another reason is that it outlines a justification for better software development practices. The software was all developed by a single developer who apparently produced minimal documentation and did not need to subject his code to adequate testing.
The story seems reminiscent of the much more recent Boeing incident:
The two stories are very different, but both consist of:
- Software that did not work as intended, to then be used by operators/pilots who did not know everything they needed to know about the new software
- Companies that promptly pinned the blame on the operators/pilots until they were forced to admit that they made a mistake
- Actual deaths
But the motivation for writing this was actually the below:
This is a video about a company that did not use microservices properly. They likely lost millions of dollars, but it is not as newsworthy a story. That being said, these stories can be more useful from a technical standpoint. If we are interested in every story, we can focus in on failures that are relevant to our respective niches…even if the result is simply that an important website was down for a few hours and a lot of users were angry about it.
Closing Thoughts
As I write this, I am starting to realize exactly what the critical comment was saying. The next post will use the same picture, but jump right in with the story, the results, and the lessons learned. Unfortunately, the “code” above is actually pseudo-code. To the best of my knowledge, no one has posted the actual source code since the incidents, and the paper derived much of its material from depositions.
I think junior developers often join companies with one of two very different mindsets:
- They think they know everything, prompting them to try to fix very large and very complex systems before they fully understand them
- They think they know very little, but also that the companies they are stepping into are run by highly competent, extremely intelligent software engineers who do everything they can to ensure that the company produces bullet-proof code
As a student reading about the Therac-25 incidents, the case was easy to dismiss — hire more developers. Add safety features to your hardware. Accept responsibility and write good test cases.
But in the microservices story, it sounds like everyone just failed. They probably believed, by some massive diffusion of responsibility, that the system made sense and would work because so many other people believed the system made sense and would work. When they finally tested it after more than a year of development and it completely failed, they were probably shocked. Various components worked on their own, but not when the entire system was rolled out.
I think failure is a good topic to study.