29 Jul 2019 We can only secure what we know we have – Rick Kaun
READ HERE THE FULL TRANSCRIPT:
Intro: The Industrial Security podcast with Andrew Ginter and Nate Nelson, sponsored by Waterfall Security Solutions.
Nate: Hi everybody and welcome back to The Industrial Security podcast, my name is Nate Nelson, I am here with Andrew Ginter, vice president of Industrial Security at Waterfall Security Solutions who’s going to introduce the subject and the guests of today’s show. Andrew, how are you?
Andrew: I’m well, thank you, Nate, it’s always a pleasure to join you. Our guest today is Rick Kaun, he is the vice president solutions at Verve Industrial Protection. He’s going to be talking about asset inventory as the foundation of industrial security and IT/OT convergence.
Nate: Let’s get on over to your conversation with Rick.
Andrew: Today’s topic is how asset inventory, contributes to industrial security contributes in a world where IT and work control system, OT networks are increasingly interconnected, where IT teams are increasingly responsible for at least some of how IT systems work. That sum of how OT systems work, the responsibility often includes security. Very broadly, we’re talking about IT/OT convergence, everyone’s heard of the term, Gartner coined the term in the mid-1990s, at least in my recollection, back then, it was very controversial. Today, it’s less so, why are we still talking about this?
Rick: It’s funny that we say it was controversial, I think it’s somewhat still is. The discussion we had prior to recording here, you and I were going over some common players in the space and well-intentioned, but the language draws a great deal. I was at a conference in Sacramento last year, Nesbitt’s Cyber Senate series, and Zack Tudor from INL said, “There shouldn’t be convergence, there should be a shadow IT, everybody do their own thing,” because there’s a real passion that OT has, like we talked, it’s about safety systems and what you can or can’t do. But the reality is that IT has been injected into the OT space. We’ve introduced all these plug-and-play IP addresses, we’ve got this explosion of technology with IOT wireless sensors skate along while some people wanting to move to the cloud. We don’t have a choice, we can’t just take the benefit, we have to do the heavy lifting and work. So, to me IT/OT convergence is from an OT perspective (which is what I do), and we talked about it earlier, Andrew, we need to get to where the corporate appetite is for risk and what the best practice still in EADS, but we on the OT side may need to challenge a little bit different paths or a different timeframe to get there. We all need to drive in the same direction, but we need to be OT specific in our application.
Andrew: So, Nate, it sounds still controversial. The trend in communications for the last 40 years has be the greater connectivity. We all want greater connectivity for pretty much everything because we want the benefits. The trend for the last 20 years has been applying greater connectivity to IT/OT integration, it sounds like the problem’s not solved.
Rick: Right. And of course greater connectivity comes with all these great things, but it also comes with all kinds of new attack vectors. And so, you really got to put in a little bit more elbow grease when all of your systems are connected like this to properly handle the new security threats that you’re going to face.
Nate: That’s right. And this was my next question to Rick, how do we make this work?
Andrew: In the world of IT/OT convergence, you’ve talked about some of the differences the challenges, my own guess is that the trend has been going for 20 years, it’s not going to back off, how do we make it work? How does this succeed?
Rick: And the majority of what I want to go through today are some case studies of how you would actually do it. And how you would actually do it is you need to be having discussions from a point of empirical knowledge, you have to have content. Too much of the discussion (to your point) for 20 years has been conjecture or opinion. I had a discussion just the other day with a fairly senior person and a large organization and his job is to make sure that the pipeline is running the way it’s supposed to run. And so, he has pipeline integrity and he has leak detection and he has operations, but there’s a big push in that organization to take networking away because it looks too much like what IT does. But in the reality is he can’t go without communications for more than, well, a few seconds. So, if I’m supposed to run the operating and spinning equipment and I want to make sure that I do my job and my boss wants to make sure I do my job, I can’t just write an SLA for 1 of those 3 main components that I absolutely have to have to safely operate. If I can’t control all 3 of those, then I’m not truly in charge of making sure they stay up. So, the challenge though is he gets into these discussions as to the other side and they say, “Well, we’re doing it for this business unit,” or, “How bad can it be? Is it 300 systems? Is it 3000 systems?” The reality is that, once you actually have the discussion, it has to come down to, “Well, how many of these are we talking about?” If I need to apply a new group policy and I’ve got 10,000 assets that are potentially in scope, but it’s going to break something on 5 of those 10,000 I shouldn’t be as upset as if it’s going to break 9500 of those 10,000. And so, the context is a huge deal and too many people are speaking from conjecture.
So, to me, what I would recommend is that we borrow a page from what is called ITSM, very well defined term, IT Systems Management, and we’re pushing it as a OTSM or OT Systems Management. You need a programmatic approach, you absolutely have to have an eye towards what the program looks like. You cannot just buy technology, you cannot just buy a tool, cross it off the list, in OT, it won’t scale. We talked at the beginning, IT can standardize, everything’s going to be Windows 10, everything is going to be using SCCM, and everything can be centrally managed from a single domain controller. We know that there’s no way that’s going to happen in an OT space. So, for the convergence to work, the OT and the IT have to come together and start first from, “What is our objective? What does the end state look like? What we like to say, what does ‘done’ look like?” If that’s CSC 20 with the specific maturity on inventory for automation and quality of content, so be it, but that whole context has to be understood before we run out and buy a tool.
One of the most important things though that I think needs to be considered and what empowers that is that inventory. Too many people don’t know what they have or don’t have. Too many people have great protections that then potentially curtail their ability to see the big picture. What I mean by that is we go to a large organization and they’ve got multiple departments with multiple tools each doing their own thing, and they often don’t necessary even talk amongst themselves. So, having that comprehensive detailed inventory to start from, “What do we have? How many of them do we have? How are you protecting them? How well are we doing?” is an absolute foundational requirement to be able to build that plan or build that program going forward. And so, your point in the point about this whole topic, if we don’t have that empirical level set to start from, we can’t make intelligent decisions as a team.
Nate: Andrew, Rick talked about a lot of different stuff there, and I’m trying to connect it in my head, it seems to me like if there’s 1 theme to the last 3 minutes of what he was saying, it’s that effective communication between people, between teams is integral to good security.
Andrew: It is, he made 2 examples, he had an example early on about machine communications, but then he then he moved on to people communications. Let me give you the machine communication example first just in terms of context. He was saying that when people talk to each other, they’re trying to figure out responsibilities, and he gave the example of a person in a pipeline organization responsible for the operation of the pipeline and communications is essential to that operation. And arguing with the IT teams because the IT team said, “We manage all communications contracts for the whole company,” how does that work? Well, just to give you a concrete example, he said SLA, Service Level Agreements, might not be the answer, you might need something deeper. I’ve done a fair bit of work with pipeline companies over the years. I remember one company talk to me about above their communications. I remember 5 levels of backups in their communication system, it was just crazy. And the reason for all this is that, if you’re dealing with an oil pipeline or a gas pipeline, you’ve got flammable explosive contents in the pipeline. And so, there’s a lot of regulations, there’s laws about how you got to manage the contents of the pipeline. As I recall, in some jurisdictions, if you lose visibility into the operation of the pipeline, let’s say you lose communications, you can’t see what a section your pipeline is doing anymore, well, now you’re doing what’s called flying blind; you cannot see what’s happening. In a lot of jurisdictions, by law, you’re able to fly the pipeline blind for exactly 30 seconds, and then by law, you’re required to shut it down. Within those 30 seconds, you madly try all of your communications backups to try and restore communications so you can see the pipeline again and keep it running. And if you can’t, by law, you’ve got to lift the cage on the big red mushroom button, whack it and shut everything down. And so, communications was important.
And I don’t remember the details, but it was something like a fiber running along the pipeline; that was their primary communications. If some backhoe somewhere cut the fiber, I don’t know, nowadays it’s probably something like cell phone Internet as a backup. And as a backup for that, they had plain old telephone lines with 5600 bod modems as a backup, and a backup to that was like satellite communications; and I missed one. As you got deeper and deeper into your backups, your connectivity got slower and slower, but you could still operate the pipeline. You might not be able to operate it as efficiently as you did before because you don’t have as much data in your hands, you can’t profit from that data, but you could operate the pipeline. And so, his point about pipeline communications dovetails into people communications, talking about responsibility, who’s responsible for what? How do you specify a service level agreement with 5 layers of redundant pipeline communications, each layer more costly than the other, each layers slower than the other, how do you do that? You got to talk to each other. And his point on the end, we’re going to come back to asset inventory very quickly here. I would paraphrase this point as basically we’re talking about industrial security here, you can only secure what you have. How can you secure an asset that you don’t know exists? And so, an inventory of the assets that you’re securing and their status is vital; this is talking about empirical data versus talking about people in a room with blank pieces of paper in front of them arguing over theory.
Nate: I’m not sure if this is an obvious question, but we’re talking about asset inventory here what inventory? Are we talking about pumping stations? Are we talking about computers, software switches, what are we talking about?
Andrew: That’s a very good question and later, most of the podcast here is going to be spent answering it, but I asked something like that question to Rick next. So, let’s go back to him and see what he has to say.
You gave an example of 10,000 systems, if 5 of them go down, it’s not a big deal, if 9500 go down, it’s a very big deal, if 5 pumping stations go down your pipeline stops, doesn’t it? Can you clarify here?
Rick: What I’m saying 5, I mean 5 particular maybe IP addresses, particular engineering station or that particular file server domain controller, not 5 full pumping stations; which is a great clarification to make. But it does underscore the fact that not all assets are created equal. I could go out and find 10,000 IP addresses, not every single one of them I need to give full level protection, others I’ll probably give passing protection or as we discussed earlier, we have a client that’s considering taking all network based attack vectors for specific risks on systems. And if those systems sit in a lower level of the Purdue model, like level 2, they want to arbitrarily cut that risk score in half and make it that company’s own perspective of the risk. Because, yes, it’s network based attack, it’s not necessarily considered a critical asset as far as operation is concerned, it is redundant, it’s multiple layers away. Perhaps it’s behind a data diode, network based attack vectors suddenly, while important, aren’t my biggest concern. And that’s what I meant by understanding, “Is it 5 of 10,000 that they need to work around? Is it most of my systems that I’m wide open like the RDP risk that came out recently?” That whole context not only within specific risk, but that asset within the system of operations is where we then start to make informed decisions.
Nate: Rick mentioned the Purdue model, what is the Purdue model?
Andrew: The Purdue model is, you can look it up, it’s a way to organize industrial control systems to group assets so that they make sense. And it was initially invented as a way to group assets so you can talk intelligently about networks of things, the device network, the HMI network, instead of just computers all over the place. But nowadays, people think of it as a model for security as well. Because you drop a firewall or a unidirectional gateway or other kind of protection between each of these sub networks, often they’ll arrange the model and layers of networks, now you’ve got many layers of networks you got to get through to your target. I think this is what he’s talking about, the layers concept, the stuff deeper down is better protected.
Nate: And layered security sounds very satisfying to me, but on the other hand, isn’t it possible that if you layer security measure upon security measure, you might just be creating for yourself a false sense of security that maybe there’s working hard versus working smart, you could just throw a lot of defense systems on top of each other and it might actually be better to be more efficient, think specifically about what you need and what you don’t need and then implement your defenses that way?
Andrew: Yeah, very much so. I remember an example, I had a new guy start with me in one of my roles over the years and we said to CISSP training, (I’ll see what I can remember it) certified something security professional; shows you what I know. But the acronym is still the most widely recognized security credential. We sent him to training for this came back having learned about everything. And I basically said, “So, I’ve got an industrial network here, given what you just learned, apply the knowledge, how would you protect this? What would you deploy?” he says, “Well, I would take my list of Defense’s and deploy one of each.” And then I showed him using the standard attack patterns, the standard attack tree how you would breach his one of each architecture, and he looks at it and goes, “Oh, there’s more to it than that,” yes, there is. You can’t just throw one of each, you need to evaluate the resulting security architecture against well understood attack patterns and make sure that that you’re covered. You can’t just throw 6 layers of antivirus in there and say you’re done. So, on the one hand, you’re right, on the other hand, it is possible to arrange layers of security so that each layer really is robust and the next layer of network down is that much harder to get into, to the point where the deepest networks are supposed to be practically impossible to reach without insider assistance, deliberate insider assistance at your target. And now we’re not talking a cyber-assault anymore, we’re talking a physical assault. So, this is sort of the ideal that a lot of people hold up, but back to the point here, back to Rick’s point, yes, if you have a robust posture, layers of security like the Perdue model and you’ve done it right, you can legitimately discount certain kinds of vulnerabilities, the severity of certain kinds of vulnerabilities deep in the architecture because those vulnerabilities are so very difficult, that equipment is so very difficult to get to and exploit those vulnerabilities. So, his point is that all of that’s relevant to the risk equation. And, again, he’s going to come back very quickly to, “We need the data in order to carry out those calculations,” and that data is basically different kinds of inventory data.
Nate: So, if I can just summarize, layered security in and of itself is not sufficient, but layered security done thoughtfully is.
Andrew: So, Verve has a suite of technology that helps with asset inventory. Can we talk about what you have for a minute, please?
Rick: We found that the number one thing that you can do in this space is to actually connect to the endpoints and properly profile them. Everybody, even all the passive tools are getting lots of opportunities because of the inventory claims, but in reality, they were built for something completely different. And so, we have no problem with the passive anomaly detection tool, but we’d rather it was used as a detection rate as opposed to inventory and profiling. And so, we think of it as an ‘and’, not an ‘or’. So, to be clear, yes, we have a technology we think is superior in the inventory and capability, but remember, part of this discussion is also about systems management. The agent as opposed to a passive listing tool, the agent allows you to take change on the endpoint as well. It’s not enough to just say, “Hey,” like the plant manager at air separation unit once said to me, “Oh, you’re going to do an assessment on my OT space, you’re going to tell me that I don’t pass and I don’t change passwords, what good is that?” And the reality is, he’s got a point. Unless we can actually do something about it, we’re not making much of an improvement.
Andrew: I just wanted to interject briefly here, Rick talked about anomaly detection tools. I just wanted to point out to anyone not familiar with the technology, what he’s saying there is that if you’ve got technology, an anomaly based intrusion detection tool that is listening to the network and inventorying all of the communications, all of the active connections, you can produce a list of IP addresses, you can gain some information about what kind of machines they are. “This one seems to be communicating the way a Windows machine would, that one’s communicating the way a switch would.” But it’s much harder to deduce things like, “Yes, and 6 of 23 recommended patches have been installed on this Windows machine, and the firmware on that switch is 2 years out of date,” it’s very hard to draw those much more detailed conclusions. So, his point here is people sometimes are tempted to use detection tools, because they will produce a list of IP addresses, but the depth of information that you have there is very limited.
And he’s been talking about at a very high level here, I asked him to go deeper and really talk about what his stuff does in more detail.
Can we dig a little deeper, please? I just heard you talk about what your technology does. Dig a little deeper for me, what is the technology? How does it work?
Rick: The technology itself, we talked about the biggest challenge was having an inventory or an empirical start point, that’s just the first step. Once you understand what you need, the biggest challenge in OT is maintenance and management. So, that whole we talk to ITSM or systems management, the emphasis on management, it’s fine to have intel, you have to be able to act upon it. So, that’s why we built what we did. We built what we call an OT systems management technology. Now, there’s a basic setup component, and the case studies will walk through different adaptations and uses of the technology, but the setup component is a comprehensive real-time automated inventory. We’ll talk a little bit later about the coverage, but in essence, anything with an IP address in the space needs to be connected to either with, if it’s an OS based device, we put an agent on it, we make no apologies for it. The main OEM vendors put all sorts of agents on their systems to do whether it’s antivirus or backup tools or what-have-you, and so these are safe and supported and is embracing IT in the OT space. So, we put agents on anything running an operating system, we use a number of connection methods to get to networking and switch gear as well as embedded stuff. We talked to controllers, we can talk to the Rockwell asset center, but we’d rather talk to the PLC’s because it’s the source of authority.
So, we build this automated real-time inventory into a centralized database, and we normalize all the data and we make everything in that database around the asset. We then allow for tribal knowledge or metadata to come in so that the operators can say, “These are critical,” or, “These are redundant,” or, “We could live for 3 days without them, we don’t care.” We also allow them to put in where the system is located so that if they need to troubleshoot or send somebody, they know where it is. Once we have that asset record with empirical or tribal knowledge or metadata, we then add third party data, “What’s the end of our status? What’s the backup status? What’s the white listing status?” This helps to build what we call a 3600 view or I call it the hockey card (I’m working for a US company, I don’t need to say baseball cards, they understand what I’m talking about), but it’s a profile of that asset. And then that view, you can then take things like the national vulnerability database and layer over top. And now we’ve gone from something we didn’t even know we had very well in terms of an inventory and start to apply practical context to it.
Now, the very next step that the platform and our technology and even our services both before and after help with is we can now take action and its measured and its directed and it’s prioritized. Because I can take, for example, a national vulnerability database on an asset list of thousands of assets and come up with 10,000 risks. But if I can boil it down to those that are considered critical assets with critical risks that don’t have a recent backup and white listing, isn’t locked down, I know where I need to start. Now, the last loop though when asking, “Who consumes this data?” well, the reporting stuff can go to compliance, he can go to regulatory, he can go to CSO, CIO which gives them huge insight that they don’t necessarily have today. But you can also (within the technology use) the agents to make change. We can actually push a patch, we can tune an endpoint, we can turn on or off a port or a service. And so, the systems management, the management part is what really comes together in this multi-discipline platform that we built.
Nate: Andrew, this is starting to get a little bit dense.
Andrew: Let me summarize. In the terminology of the industry, the agents are software; software that’s installed on machines that will tolerate software being installed on them. He also talked about connecting to and querying devices like PLC’s or network switches where it’s not possible to install the software directly. In the parlance of the industry, this is called an active solution; there’s stuff in there in the control network actively doing things. And he also mentioned CVE, which is the national vulnerability database. The point there is, if we have software on a machine, let’s say a Windows box, it can report, “Look, here’s the version of Windows, here’s all of the software updates that are installed.” If we’re connected to a PLC, we can ask it it’s version of firmware, we can do the same thing to a switch. We can, now he said layer the CVE on top of that. If we got the vulnerability information in hand as well, what we can do is compare one to the other and say, “This switch here is running firmware that’s 2 years old and the CVE database has identified 3 vulnerabilities in that version of firmware. That version of Linux over there, same thing, it’s a year and a half old, we’re missing 8 vulnerabilities.” So, the agents are gathering a much deeper knowledge of the assets that are being inventoried than you would get by just looking at network traffic (this was back to his original comment on intrusion detection tools) and we can couple that with the vulnerability information.
Nate: Another thing Rick mentioned was tribal knowledge. Can you talk about why that’s important?
Andrew: Yeah. So, that’s not a technical thing, that’s sort of what’s walking around inside people’s heads. Tribal knowledge is knowledge that’s not been written down. It’s great as long as people are around, but if people retire, if people are promoted into other divisions, if they leave because they join another company, you lose the knowledge in their heads. And the knowledge can be very important to troubleshooting and even planning security. For instance, tribal knowledge might be, “This particular machine, this IP address, where is it? It’s in that room over there, it’s in the third cabinet, it’s the 8th machine up in the rack. That machine, that IP address over there, that’s in a substation, the other one’s in a pumping station in 300 miles away,” knowing things like where they are, knowing things like why they exist. It’s one thing to say, “I got a Windows box with 7 kinds of software on it with these versions,” it’s another thing to say, “And we put it there to do this.” So, that kind of knowledge that’s in people’s heads (he called it metadata) is important to capture into the database as well, and makes the information in the database about software and versions much more valuable.
Nate: Right. And before we move on, Rick talked about agents actually changing stuff, tuning things, changing settings, is that a good idea?
Andrew: Yeah, all change has its risks, and that’s a topic that Rick came back to much later in the podcast. So, let’s go back to Rick and we’ll catch up with that topic a little bit later.
Can you give us some examples, some guidance, how do we put this into practice?
Rick: I said inventory is important, I said negotiation is important in dealing from an empirical starting point, and that you want to look at the overall program as an end goal. So, for that, I want to share 3 different case studies. And we can break them up and go into them as we go here, I’m not going to hammer through all of them put without a pause here, but 3 that I think are different applications of how you would take an IT mindset or a systems management perspective and a programmatic approach and use technology in an OT space to drive significant change or significant benefit and value. Now, all 3 case studies are predicated on the use of an agent based approach, a comprehensive asset inventory that covers what we call north-south and east-west coverage. North-south means, “I’m not just going to look at networking gear, I’m not just going to passively listen and hope to pick up some IP addresses, I’m actually going to go all networking communications gear, all endpoints that run OS with an agent, all embedded stuff, relays, PLC’s controllers. I’m going to go all the way down as far as I can go, anything that has an IP address,” and in many cases, also types of technology they have say a gateway, like a Schweitzer gateway with serial connected stuff in behind as well, because those are equally important to the safe operation. So, north-south coverage and then east-west. So, I like to joke that we equally piss off all OEM vendors. So, a lot of them don’t like other people coming in on their system, a lot of people don’t like potentially taking away business, but we have found more and more that many of the OEM vendors are just not able to keep up with the speed and development of security tools. We joke that (well, not even joke, it’s reality) Honeywell doesn’t want to certify our stuff because it would cost us both a million dollars’ worth the testing, and every time either one of us made a change, we’d need to do it again. So, what we’re advocating though is we’ve been doing this for 10 years, we’ve done it safely, we’ve never tripped a plant, we’ve never voided a warranty. And when you look at what the vendors are putting on their devices, there’s lots of agents, there’s lots of more and more tools coming, it is the future. We’ve seen ABB say, “You know what? We can’t stop it, let’s work together.
Andrew: So, the thing that struck me about Rick’s answer there is the value of automation. It’s possible to do this kind of inventory manually, go and look at the machines, figure out their IP addresses, log into them, figure out what software is installed and so on. And some people do the inventory manually, it takes some time, it takes some people, you produce a report, the thing that struck me was a day later the report’s out of date because things change constantly. Now, they don’t change dramatically constantly, they change slowly, but more or less constantly; odds are, a day later, a week later the reports inaccurate. With the automated system, one of the benefits of any automated inventory system as you press a button and you get a new report, and you can see what’s happening day by day, it doesn’t go obsolete immediately. So, that struck me in his answer.
Nate: And he’d also talked about 1 million dollars in testing, what was that about?
Andrew: Well, this is a tricky bit. Basically the vendors have a lot of say as to what software runs on their equipment. So, for example, if we’re talking a power plant, if we’re talking refinery, often the vendors of the PLCs, the vendors of the control system software, they will often have a person on-site or even 2 people on-site at a big site, and it’s those 1 or 2 people’s job to keep the vendors stuff running correctly because they are the experts, they are the vendor. What those people don’t like is when somebody else, the site, the owner and the operator comes by and says, “I’m dropping some third-party software on your machines,” they say, “No, no, no, that’s not our stuff anymore, we manage our stuff, we don’t manage other people’s stuff,” and they push back. And if the vendor wanted to certify somebody else’s stuff, if you wanted to take all of the Verve agency, the example he gave, all the Verve agents and certify that they run correctly on all of Honeywell’s equipment, well, Honeywell has a lot of kinds of equipment, they would have to run a full regression test for all of their product on every one of their products with the Verve stuff installed. This is the million dollars, it costs a lot of money to do this certification. So, that’s why he’s getting pushback from some of the vendors from some of the owners and operators saying, “I don’t know about this agent stuff.” So, I asked him next about the agents, what’s going on with these agents? Does it really work? Do people actually let you install this stuff?
So, you’ve talked a lot about agents, and I know from personal experience years ago, I was working for an organization that was trying to install third-party agents on a lot of OT equipment, there was such pushback. Is that still the case? What do you get in terms of pushback on agents on OT equipment?
Rick: Yeah, it’s actually my favorite discussion, I say that somewhat tongue-in-cheek. I recall the conversation actually with a superintendent of a very large power company once and she was asking me questions about, “What do you do and how do you do it?” I said, “Well, we put the agent out and it allows for an automated inventory, it allows for that insight, it allows for a lot of flexibility,” and she said, “Well, but we don’t allow agents on our devices.” And I said, “Okay, well, what are you using today?” and, well, they’re an Emerson shop so they had the Emerson innovation security center at that point in time, which used Lumension which is an agent and had a Cronus which had an agent and it had an anti-virus and I said, “Well those are 3 agents you have in your systems,”
“Oh yes, well, we don’t allow additional agents,”
“Okay.” We continued to talk and then they said that they were experimenting with putting a white listing to it like carbon blackout, and I said, “But that includes an agent,” and again, she’s like, “Well, yes, it’s an agent, but we don’t want to put any more agents out there.”
And so, it was this really weird sort of catch-22 that no matter how hard I pointed out that for 10 years, we’ve been putting agent, in fact, our systems integration, half of our company was putting a Cronus on Ovation before Ovation did. It’s a matter of understanding the safe environment and what you can or can’t get away with and how to treat those systems with respect more so than a hard and fast rule of just an agent. In fact, we had one organization say, “Well, the fact that you’re going to talk to my endpoints, even if it’s in their own SNMP language or with their own credentials or even with the third-party vendor software, you’re not doing it in a normal operation, that’s considered invasive and it’s not allowed.” So, there is still, to your point, a lot of pushback on embracing technology, which is kind of the frame and the concept of the whole discussion we’ve had is that, don’t fear the tech, IT and OT can and have to work together. And if we can find a way to build empirical data that we can together look at and make informed decisions, but even better than that, actually take effect, one of the things we talked about prior to this conversation, you and I, Andrew, was other industry verticals and things that are happening or not, and a lot of people kicking tires, but nobody jumping in. And I fear that it takes a lot more convincing and educating than it maybe needs to, but I guess it means I’ll still have a job for a long period of time.
Nate: Being somebody myself who’s afraid of change, I can empathize with the people that Rick’s annoyed with.
Rick: Absolutely. And people are afraid of change for many reasons. In the industrial world, it’s not so much fear as risk. Every change is a threat, every change, there’s a chance that you fumble finger something and you’ve got a safety incident, people get hurt or you have a reliability incident, part of the plant trips and shuts off. And so, change is risk, this is why we have the engineering change control discipline, this is the engineering discipline of managing the risks associated with change. You do your analysis upfront, you plan for the change, you execute the change, you evaluate, “Did that work?” you learn from your mistakes and repeat. But engineering change control is not the same as no change. What I see happening here is Rick talked about the cost of certification, and back when I was doing this, a lot of people were talking about certification and it was a very potentially expensive process. What people seem to be moving toward now, he didn’t use the word, but I heard him talking about a reputational component to engineering change control. Instead of saying, “You must be certified everywhere in spite of how ridiculously costly that is for everyone involved,” people are starting to use the reputations of vendors, the reputations of pieces of software, not just Verve’s agents, but other agents that are being installed on industrial systems because those agents have value, that’s why they’re being installed. And the experience of other sites with those agents is being taken into account in the process of the change control calculation. So, the bottom line is that change is happening, there’s too much good stuff out there. What I heard from Rick is that the engineering change control is helping to manage the risk, it’s not forbidding change and it’s not forbidding new agents. So, that’s an interesting development in the decades since I was dealing with agents myself.
You said you had another example.
Rick: A detailed asset inventory and the empirical data I’m talking about for a large FDA regulated manufacturing company recently took the agent-based approach with agent list profiling of switches and network (unclear) [38:22] multiple technologies to connect. And Monday morning, they installed the technology, they discovered what assets run the subnet, they deployed the agents and the agent lists components and they tied it all back to its central reporting component. And the detailed analysis that they took from that asset list, they then put in front of the operators. So, if you recall the moment ago, we talked about putting context, what I call metadata or tribal knowledge, once I have an asset list, I put it in front of an operator and tell him, “You list for me which is critical, which is redundant. From a physical or maintenance perspective, let’s put which room it’s in, which rack it’s in, which unit it’s on, which floor it is, and really sort of start to build that.” We then take in the third party data, antivirus data, backup, white listing. And in many cases in immature organizations, a lot of that doesn’t even exist, but the idea was that we built a record all around the asset and everything the asset had or did.
Now, in this first case, the client deployed this technology, by noon on Monday, they had all the IP addresses figured out, by the end of day Monday, they deployed the agents to all the endpoints, they tuned all this data and they spent Tuesday and Wednesday going through the data that they were able to uncover, and Thursday at noon, they presented to the board. What they had discovered in those 72 hours was that, of the 485 IP addresses, 7 machines were dual-homed. 5 of those dual-homed machines were running things like TeamViewer with admin rights. They found 179 instances of devices not patched for NotPetya and WannaCry, they found a half dozen PLCs that had known exploits from the national vulnerability database. Now, the benefit to this client, they embraced this what many would consider an invasive or a more heavier touch in the OT space to pull all this data back, and now they had empirical evidence for budget justification, for order of priority, for order of magnitude, right? And by seeing in the software list for example how many had antivirus and of what version and what age, you instantly are able to plan your mitigation and your remediation. So, as they go through this process, that same reporting is still available and they will see, as they make changes, improvements to those scores and all those big red dials start to drop down. But it’s all about empirical analysis of actual data from the endpoints to say, “Here’s what we need to do in which order and here’s how mature we are or rather how far away we are from where we want to be.”
Andrew: So, Rick talked about a lot of sort of technical terminology when he talked about his inventory and how much bad stuff it discovered. Let me define a few terms really quickly. Dual-homed hosts, this is a computer that’s connected to 2 networks. A firewall is a dual-homed host, it’s a computer, it’s connected to let’s say the IT network and the industrial network, and it’s designed to be a security device, it’s designed to be secure to enforce security policies. If you’ve got another host beside the firewall that’s also connected to the IT network on one side and connected to the control system network on the other side, that’s a problem, that’s called a dual-homed host, and the host generally is not designed as a security device and generally is less well protected than a firewall. And so, having 5 of these, basically they’re saying we have 5 connections between the IT world and the control system world or between 2 pairs of networks that are not firewall, that are not managed as security connections. He also mentioned TeamViewer. TeamViewer is a way to do remote access, it’s like remote desktop. If TeamViewer is running as the administrator, anyone who guesses the password and logs in is running as the administrator and can do anything they want to your machine. So, that’s a bit of badness and the NotPetya and WannaCry, WannaCry was was ransomware, you want the security updates in place so WannaCry can’t hit you. NotPetya was worse than ransomware, it erased your machines, these are topics worthy of a podcast all in their own. But, yeah, the vulnerabilities that that those nasties exploited our vulnerabilities that most control systems would put a high priority on fixing. So, this is sort of the badness that was discovered.
Nate: And also of note was how rick mentioned budget. Budget’s something that we talked about a few times with other guests on the show, “How do you translate the priorities of IT/OT teams with the executives who have to actually make these decisions?” I found Rick’s perspective straightforward.
Andrew: Everybody wants to know how to spring open up budget for security. I caught the same thing when I was interviewing Rick, that was my next question to him.
A lot of people want to know, “How do I spring free some budget for the OT end of IT/OT integration of IT/OT security,” can you talk about that?
Rick: Yeah, we often get lots of questions around that, and sometimes, the question is, “How do I get the CSO to give me money in the first place?” And in the case of the case study we just talked about where I’ve got a list of assets, and so if you recall, we get all the assets, we get all the risks, and then I’ve got the metadata to say here’s what the issue is, once you see how many devices are dual-homed or mistakenly homed and how many assets you have in general, you have a pretty good parameter to start to say, “I’m pretty sure a network segmentation project would look like this.” We have many clients that use data diodes, understanding what their data flow for historians and how many they have and where they have. I can instantly enumerate with this data how many systems are running PI or OSI and therefore helped to design a data diode upgrade or a network segmentation upgrade, so there’s some real empirical evidence in there. But the other part that many, many people forget about is I can take that data and I can show how many of my so-called critical assets that a key component have thousands of risks and aren’t properly segmented. That usually gets a CSO attention and usually justifies a budget. There’s another part though, once I put this stuff in and I’ve got it segmented and as new patches come up and as backups fail or systems go from primary to backup, that’s all shown in the database as well. So, we can instantly see once we’re looking at this reporting concept, see just how quickly we fall behind on patching, how many times we need to touch things for troubleshooting backups, or if we see changes in configurations. We’re able to get all that data, and from a managed services perspective, we can start to scale economies of skill sets from central locations or we can tune our third-party managed services to come in at whatever intervals at appropriate levels, and we can instantly drive into that sort of intelligence. So, there’s not only that upfront justification to go and do stuff because we’re this misaligned, there’s ongoing evidence of how well we’re keeping up and how much more time or attention we may need to spend.
Andrew: So, you talked about monitoring, you talked about agents helping with discovery, but you mentioned management. Can you give us an example of how that would work, where it’d be useful?
Rick: Let me walk you through a case study of an example, I’ll give you a 2; 1 one that just happened a while ago, and 1 that we just blogged about if anyone’s interested to go check it out. The first one was an organization that had our technology across 6 different generation facilities, and the reporting console came up into a single headquarters and an SOC team, and that single headquarters was able to view all 6 sites and all their current data. They got a call on a Saturday afternoon (true story) the CSO phoned and said, “I asked you guys to remove all versions of Kaspersky antivirus,” they didn’t want, as a US company, a Russian based antivirus technology and their operational facilities. And so, they said, “Yeah, well we’re working on it with unions and change management and blah-blah-blah, it’s working,” he says, “Well, it needs to be done, I need to report on Monday that it’s going to be finished.” So, they said, “Okay, I guess we’ll do this.” So, what they did was they went to the console and they identified at each of the 6 sites which systems were running Kaspersky because you can drill into the dashboard and see what software is enumerated on the endpoints. They identified 147 instances of Kaspersky at 6 different locations. They then used the agent, and this is where it really becomes cool on the management side, and we talked about budget a moment ago, but this is where the ROI really comes out. They then sent a command via the agent to all 147 instances to uninstall the Kaspersky program, but they set a flag that said, “Make this an offer.” Now, what that meant was once they were done sending the command, all the endpoints were ready to uninstall it, but we then took that same list per site and we sent it to a local representative at each site, and that site had the list of assets, and if you remember, we allow them to put which room in which unit which rack; so, were able to walk directly to each of the assets. And because the assets had the command queued, but the flag said ‘Making an offer’, the tech had to log into the system, accept the uninstall command, and oversee it being done. So, you have that last mile OT oversight with IT technology. Now, in a typical organization that had 6 generation facilities dating back to the 70s and 80s for vintage and number of systems and complexity and what-have-you, to get a call on a Saturday afternoon and try to identify an uninstall 147 instances at 6 different locations, 4 different States would take probably days, weeks, months. We did this start to finish and sent the report to the CSO in 90 minutes.
Andrew: What I took away from that was this make an offer feature. I’ve never heard of that before, but it seems like such a good idea. He didn’t say, “I pressed a button and uninstalled 147 copies of Kaspersky and 6 of the machines blue screened, oops,” that’s classic IT/OT mistakes, “Oh, I scheduled a backup and it rebooted 100 machines and 6 of my power plants dropped,” that’s classic mistakes on the part of IT people applying IT policies to OT equipment. What he said here was, “I’ve got this tricky thing I’m going to do, send a message to the machine and send a message to the people responsible for the machine.” Now the people have to walk up to the machine they log in and they see a screen and it says, “Hey, I would like to uninstall Kaspersky for you. Click OK if I’m allowed to do this, click Reject if you don’t want this done.” And now what the people can do, they’ve got a list of machines at their site they have to do this to, they go to their least important machines, the screen comes up and they say, “Yes, do it now,” and they watch. And if the machine blue screens, well, you’re not going to press Ok on the other machines on your site, are you? But if it works, you’re going to build up confidence with the procedure on a bunch of your machines. Now, you’re going to go to the critical machines, you’re going to go first to the critical machines that have hot standbys. You’re going to go to this standby machine, press the button, if it blue screens, the primary is still there, the power plant has not dropped. If it works you can tell the primary, “Failover to the secondary and run there for an hour.” If it still works, you might come back to the primary and say, “Okay, do it here too.” Now, you go on to your next critical machine that has a redundant, and very last, you do the equipment that’s critical and has no redundant, but now you’ve built up a lot of confidence on this procedure on the equipment at your site that you as the technician are very familiar with. This process of, don’t just do it, but make the locals do it, but hold their hand through the process, that seemed to be brilliant. I’ve never heard this before, I thought, “What a good fit.”
Nate: I’m with you Andrew, it does sound like a good idea. But if I’m just going to play devil’s advocate for a second, you mentioned the problem with just clicking a button and, whoosh, 147 instances of the antivirus go away, it does sound efficient, if errors aside, being able to just click a button and get the job done as opposed to having boots on the ground in half a dozen different facilities, all the people, it seems to me that that would incur more costs, not just in dollars, but in time and energy.
Andrew: Absolutely. It is my understanding it’s an optional feature of the Verve software, you don’t have to use it for changes that are safer than that, you can not press the button and make it take effect right away. But it’s a judgment call which change you do one way and which change you do another way, again, it’s all part of the engineering change control procedure. And I believe Rick’s had 2 examples, his next example is talking about one of those. So, let’s go back and listen in and see what he has to say.
Rick: The second case study I was going to offer in the blog that we just wrote recently on our website under the resources tab was about how you would remove or protect yourself from the remote desktop exploit that came out recently. You can conceivably (not everybody likes it from an operational perspective), but you can take the database and look at all your assets and enumerate every single one that’s running remote desktop. You could send a command to disable that service on every single one of those devices in conjunction with sending an email to operation saying, “Hey, everything’s on site today, no remote access, whatever. Phone in if you need help, we’ll do instances to help if you need it.” But in the meantime, you’ve now removed that port or service and therefore the attack vector, you can then go back into the database and say, “Give me all my different types of systems by criticality, by OS, by location, by ownership, by function,” right, and you can actually prioritize your testing and your deployment, and that same agent can also push those patches to those thousands of assets in the field.
Andrew: So, that’s an example of where it makes sense to press a button centrally and carry out a change on a lot of systems at once. The central site still needs to be a little bit careful. Remote desktop is generally safe to turn off. Who uses remote desktop? It’s maintenance people when they’re logging in remotely because they don’t want to drive into the site, it’s a convenience, it’s not vital for a second-by-second operation of the site. The one exception to that rule is, if let’s say an operator at one site is using remote desktop because they are responsible for remotely operating another site, let’s say off-hours because the other site has a 5 by 8 operator, but off-hours, the responsibility devolves to another site, that’s where it’s important not to turn off that remote desktop. But that information will be in the metadata and so it’s something that the central site can go through and scan, “Are these systems safe to turn off from our desktop?”
“Yes,” bang, turn it off, and the vulnerability is gone; it’s not gone, but it’s inaccessible. So, that’s an example of sort of instant action. We’re coming up on the end here, I went back to Rick and started asking him about the big picture again; let’s listen in.
I do remember a conversation I had with Greg Hale, and his opinion was that, just as you said, that IT and OT are talking to each other at a much deeper level than they were 10 years ago. What was there 10 years ago? There was the first IEC standard out, there was a couple of documents from the DHS, there really wasn’t much. Nowadays, there’s countless standards, there’s countless experience, but in his opinion, a lot of people were still out there kicking the IT/OT tires and not doing OT security enough. Do you agree with that? Where is the state of the world, and if it’s anything close to where Greg thinks it is, how do we fix that? What needs to change going forward?
Rick: Yeah, it’s a great question and it’s interesting because we go into an environment on 1 of 2 ways always. It’s either brought in by IT and we need to show them we understand OT and can work with OT and get OT to trust us, or OT has us come in and we have to prove to IT that their systems and methods won’t necessarily work. Like we talked about, you can’t standardize and automates necessarily just on IT standard systems. And so, yes, much more does need to happen. Now, to Greg’s point, there is much more discussion, there are a lot more IT people coming to OT conferences and vice versa. I see more and more organizations that are tying those 2 together, whether it’s formally in a racy Charter governance discussion, or whether it’s informally in terms of reporting structure within the organization. But a lot more does have to happen, there is still a huge distrust between the IT and the OT side of the house and we still are brought in often many, many times as a so-called expert to help debunk whichever other side were arguing against. And it’s unfortunate, but it’s true, and a lot of those perspectives are very, very well entrenched. I always exhort when I’m talking to an OT guy to trust the IT guys, and when I’m talking to an IT guy, to give the OT guys a little more slack. But there’s built-up experiences of IT automating something that breaks in OT something and therefore, we’re all up in arms because we’re all on-call on the weekend for some reason. So, we’re getting there, but that’s why I like the method of having empirical data that we can sit down and have an actual discussion about as opposed to conjecture or opinion like we talked about a bit earlier.
Nate: Sounds from what Rick is saying, like both IT and OT people tend to bring in Verve to get the facts.
Andrew: That’s right. In my experience, it’s always more fun to argue when there’s facts in front of you instead of an empty piece of paper.
Nate: Andrew, I have to disagree with you. Most of my most fun arguments have to do with people who have no sense of the real facts.
Andrew: Yes. Well, maybe I should have used the word ‘productive’; it’s more productive to argue. We’re coming up on the end here, I wanted to leave Rick with the last word.
So, this has been great, Rick. We like to leave our guests with the last word. Is there a parting thought you’d like to leave with our listeners?
Rick: Yes, and thank you very much for the opportunity; appreciate this. This is a topic that’s got lots of moving parts, and as we have alluded to, there’s many people with different opinions. And so, I’m a firm believer that the more we communicate, the better off we’ll be. And so, the only thing I’d want to push to people is we regularly put up these things we find in the field, lessons learned, 5 steps to a faster deployment or tuning or information or insight, we often post thread updates too once in a while if need be. So, really what love for the community to do is to reach out, go to our blog page, find a topic you like, make some comments, use the contact us to suggest other topics you might want to see, but let’s just keep the dialogue going. And so, the more we can share from what we’ve seen multiple times in the field, if I can help you, the end user, avoid making the same mistake because I’ve seen it 20 times and we’re sharing how not to do it, then I’ve done my job. But, yeah, please come see us, and thank you, Andrew, and Waterfall for giving us the opportunity to do this.
Andrew: I had a look at the Verve blog and it’s actually, it can be difficult to find. In some browsers, you go to the verveindustrial.com website, you look under the resources tab, you click on blog and it’s all good. In other browsers, for some reason, that tab does not render. So, if you’re looking for the blog in a browser that you’re struggling to see, “Where is this blog on the verveindustrial.com site (not verve.com, verveindustrial.com)?” go to info.verveindustrial.com and you’ll find the blog.
Nate: Okay. Then with that, I think we’re all finished up here. Thanks to Rick Kaun for sitting down with you, and thanks to you, Andrew, for sitting down with me.
Andrew: Thank you, Nate, we’ll catch you next time.
Latest posts by Waterfall Team (see all)
- Three Networks – IT, OT & Engineering – Joe Weiss | Podcast Episode #20 - October 3, 2019
- Layer Zero Anomaly Detection – Ilan Gendelman and Hadas Levine | Podcast Episode #19 - September 23, 2019
- Waterfall’s Unidirectional Gateways Deployed In Dor Alon Energy’s Natural Gas Power Plant - September 22, 2019