How Likely Is That To Kill Anyone?

IT teams newly responsible for OT security are often appalled with the results of an initial vulnerability assessment. “Patch everything! Patch it now!” is often the directive issued to engineering teams. The correct response to such a directive is “How likely is that to kill anyone?” Engineering teams cannot proceed with any change to a system until they have a clear understanding of the answer. And the answer is almost never “zero likelihood.”
Picture of Andrew Ginter

Andrew Ginter

OT Security blog post about how likely a security flaw might end up unaliving someone

Patching or applying security updates to some industrial networks is very hard. Why? Consider a typical refinery. The entire site goes down once every three years for a retrofit where every piece of physical equipment, large and small is inspected. What is worn out is replaced, and what is old may be upgraded. Necessary new systems and upgrades are installed. Control system computers and devices are similarly examined and replaced or upgraded. The entire process incurs enormous change. Engineering teams plan for, study, prototype, analyze and test every change for safety and reliability, sometimes for up to six years prior to the three-year outages. There are frequently two engineering upgrade teams working in parallel, staggering their results into three-year production outages, because there is that much work and analysis involved in these outages.

Every Change Is a Risk

But when everything is re-assembled, do we simply turn everything back on? Well no. Despite up to six years of analysis, we may have missed something. Every change is a risk, and we’ve changed everything. So, what do we do? Typically, all vacations are cancelled. All vendor representatives and services contractors are summoned to the site. Everyone starts putting in 12-hour days, and the plant is started and is brought up to 5% of capacity. Every technician, vendor and engineer is walking around the site, looking at things, listening to them, feeling them if it’s safe to touch them, and sometimes even sniffing at them. The plant operators, their supervisors and the engineers are clicking through the HMI screens, looking at every bit of each screen to see if both the plant and the screens are working as expected. The cyber people are looking at memory usage, network communications, and logs.

Nobody and nothing is perfect. We find problems, and we fix them. We bring the plant up to 25% of capacity. To 50%. And eventually to 100% of capacity. It’s been two weeks. Everyone is exhausted. Most of us haven’t seen our families in all that time, and still, we look for problems. We find fewer and fewer new problems. Each problem is triaged. Low-priority problems are documented and handed off to the team preparing for the next outage, three years from now. We start to stand down. At three weeks, the plant is at full capacity, the vendors have all gone home and we are back to a normal staff. Success!

“Every technician, vendor and engineer is walking around the site, looking at things, listening to them, feeling them if it’s safe to touch them, and sometimes even sniffing at them. “

Patching the System

But wait – on the Tuesday following, Microsoft issues a Windows security update with 17 fixes in it. Do we apply that update? If we do, will we introduce new problems that impact safe operations? Will we introduce a problem that trips the plant? How can we know? We do not have the source code for the changes, and even if we did, we most likely cannot find people who can analyze that much code with the degree of engineering confidence that we need. If we cannot analyze the code, must we shut down again, apply the patches, bring everyone back and start the commissioning process all over again?

Many industrial sites delay security updates. They delay installing updates until they are confident that the update will not impair operations unacceptably. Sometimes it takes months of testing on a test bed to prove that the update is safe. Sometimes the patch is simply delayed until the next outage in three years.

Engineering Change Control vs. Constant, Aggressive Change

Every change is a risk and engineering change control (ECC) is the discipline that engineering teams use to control that risk. Equipment that a vendor has certified for safety at a cost of up to a half million dollars cannot be used with security updates until the vendor re-certifies the equipment using the changed operating system. Other equipment is not updated until the engineering team is satisfied with the risk, and even then, the teams tend to apply the update to the least vital equipment first to see if the patch causes problems. Then they apply it to the machines that serve as backups for vital redundant equipment. Then they switch over to the updated backups. If there are any problems, they switch back to the unpatched primaries, and so on.

This is in sharp contrast with some aspects of enterprise cybersecurity programs that in some domains apply constant, aggressive change to stay ahead of the adversary: the latest security updates, as quickly as practical, the latest anti-virus signatures, and the latest software versions and keys and cryptosystems. These “constant change” practices fly in the face of the ECC discipline. There is simply no way to keep industrial equipment patched as aggressively as we patch enterprise networks. One consequence of this limitation is that most industrial equipment is vulnerable to known exploits for much longer periods of time than is typical of enterprise equipment.

Not All Systems Are Special

While ECC is misunderstood by many IT practitioners, ECC is misapplied by many engineers. Some patches – for example to remote access systems – may be very unlikely to impair safety, or even to impair reliability. Remote access is a convenience at most sites, not an essential element of safe or reliable operations. Worse, remote access systems tend to be among the systems at a site that are the most thoroughly exposed to external cyber attacks. These are the very systems that need to be patched the most aggressively – the IT approach of constant, aggressive change is precisely what we need for these systems.

In short, a truism of OT security is that (a) most IT teams need to learn that many OT systems are special, and (b) most engineering teams need to learn that not all of their systems are special. Yes, we need ECC to manage our most consequential systems, but we need the IT discipline to manage the most exposed systems. And if we discover in our design that any of our most consequential systems are also our most exposed systems, well then we have a very bad design, and we urgently need to change the design.

To dig deeper, click here to request a copy of this author’s latest book, Engineering-Grade OT Security: A manager’s guide.

Want to learn more about Waterfall’s hardware-enforced OT security?
Talk to an expert>>

About the author
Picture of Andrew Ginter

Andrew Ginter

Andrew Ginter is the most widely-read author in the industrial security space, with over 23,000 copies of his three books in print. He is a trusted advisor to the world's most secure industrial enterprises, and contributes regularly to industrial cybersecurity standards and guidance.
Share

Stay up to date

Subscribe to our blog and receive insights straight to your inbox