Software Engineer Failures — Therac-25

A radiation therapy machine that led to six major accidents

Curt Corginia
3 min readMar 3, 2023
Source. This simplified pseudocode characterizes Datent, a subroutine for data entry. The software failed to detect when the operator finished editing, and also set up a potential race condition.

The failures of the Therac-25 between 1985 and 1987 were so horrific that they became a study subject for computer science ethics courses. These incidents of massive radiation overdose include:

  • A patient who experienced burning, spasms, the eventual removal of her breast followed by complete loss of function in her arm and shoulder
  • Nausea; loss of ability to speak
  • A patient who felt like he had been burned, was agitated, and repeatedly asked, “what happened to me?” before his death three weeks later

What this post will be: A brief overview of the problems with the Therac-25, and what this can teach us today about proper engineering.

What this post will not be: A comprehensive look at how the Therac-25 worked and how it could have worked better. You can already find this in a 49-page paper here — though the authors were never given access to Therac-25’s actual source code, they had enough information to formulate their own design diagrams and get a rough idea how the software worked. The software was all developed by a single person using the PDP-11 assembly language.

The Problem, In Brief

Therac-25 actually had multiple problems. It had issues with its UI/UX, so that error messages were cryptic and catastrophic errors did not stand out; it had race conditions, memory mismanagement (explained in more detail on page 28 of Leveson’s paper), and replaced functional hardware with flawed software. To elaborate on the last point, interlocks ensured that the Therac-25 turntable was in the correct position; manufacturers chose to not replicate this hardware, and rely instead on software.

This next part is a commentary, but I do not find it controversial in hindsight: The disaster was made worse by a company that failed to respond to criticism. Again citing this same paper, the company was dismissive of incident reports, failed to comply with a request to install interlocks after incidents had already occurred, and insisted that their product was not to blame until more incidents came to light. To quote the UC Berkeley professor above: Be extremely wary whenever a company with millions of dollars behind it tries to pin the blame on operators.

The problem was also exacerbated by the fact that the patients already had cancer. When hospitals attempted to investigate what was happening to their patients, it was difficult to discern what was caused by the Therac-25, and what problems were related to the cancer itself. The software bug(s) had actually existed in the predecessor, the Therac-20, but in that case the incidents were corrected by hardware and all that occurred were broken fuses.

Possible Lessons (provided by the UC Berkeley professor above)

  • Software documentation is important
  • Simple designs are valuable
  • One risk in software is that it is relatively cheap to produce (compared to hardware), but that there is virtually no limit to how complicated it can become. Real systems can get unbelievably big
  • Introduce redundancy; have code checked by multiple people to avoid finger-pointing
  • Building software should be thought of just like the process of engineering bridges (his point, not mine. I respectfully disagree because we can just import the bridges)

Lastly, he said to debug by subtraction, not addition. Find out what portion of code is wrong, and remove it. Don’t add some special bandaid to account for the problem you have not bothered to learn the cause of.

By the way, this is the same professor who went viral.

What a legend.

Further Reading

“An Investigation of the Therac-25 Accidents, by Leveson and Turner”

Closing Thoughts

Hindsight is 20/20, but this could have been prevented by:

  • Fail-safe hardware
  • Better documentation
  • Code reviews
  • Rigorous testing
  • Better UI/UX
  • A more responsible company

I have been getting low on material, so this blog may go off the deep end and diverge into coding poetry before it gets back to this series and the LeetCode one.

--

--

Curt Corginia

Founder, CEO, CTO, COO, and janitor at a company I made up called CORGICorporation