Tony Turner

Efficacy of Controls

Read
Secure by Design

Topic:

Secure by Design

The topic of Secure by Design encompasses the practice of designing for security as a functional requirement. It embodies a culture of understanding and empowerment in pursuit of designing and implementing more resilient systems.

An exploration of Critical Function Assurance and the hazards pyramid applied to cybersecurity and controls efficacy in critical infrastructure

Article content

The topic of control efficacy is rooted in a need for better prioritization. We don’t really know which control is going to save us from suffering a high consequence event, so we implement security hygiene-based approaches, or maybe just what the compliance frameworks tell us we must do. But it’s very much a “spray and pray” approach to risk management without much common sense or risk-informed logic. This lack of awareness starts with a failure to understand what needs protecting and is exacerbated by a lack of understanding of which threat vectors, and therefore, which controls will provide the best return on our security investment.

We will explore how Critical Function Assurance and the hazards pyramid applied to cybersecurity and controls efficacy in critical infrastructure can help.

Efficacy of Controls in Critical Infrastructure (Part 1)

The topic of control efficacy is rooted in a need for better prioritization. We don’t really know which control is going to save us from suffering a high consequence event, so we implement security hygiene-based approaches, or maybe just what the compliance frameworks tell us we must do. But it’s very much a “spray and pray” approach to risk management without much common sense or risk-informed logic. This lack of awareness starts with a failure to understand what needs protecting and is exacerbated by a lack of understanding of which threat vectors, and therefore, which controls will provide the best return on our security investment.

In this first of a multi-part article (and it’s hard to say how long this series will go on for, but at least 2 parts) we will explore a few key concepts including:

  • Critical Function Assurance
  • The Controls Efficacy Pyramid
  • Public Health Walkthrough

In part 2 we will go into more detail on a new approach to leveraging MITRE ATT&CK, the ICS Kill Chain, other similar approaches and a “layered” approach for controls modeling. So, let’s go ahead and start with Critical Function Assurance (CFA)

Critical Function Assurance

Many people have heard of Cyber Informed Engineering (CIE) by now, to achieve “Secure by Design” in critical infrastructure. More specifically, CIE embodies the marriage of cybersecurity and engineering concepts, and is not a cybersecurity framework per se, but more of a way to embed security into the engineering process. CFA is part of this ecosystem, and in many ways aligns closely to concepts like Business Continuity Planning, or at least the early phases of a Business Impact Analysis study. This is where the core mission of the organization is identified along with the dependencies required to execute that function or functions.

CFA can be thought of as the “why” - the driver behind all the controls we are considering adopting. It asks the questions about how the critical function is delivered. What are the inputs and outputs? It helps to focus the risk management conversation on just those elements that are important to your operation. By itself, it does not reduce risk, but provides a powerful mechanism for targeting and inputs into other downstream risk management activities.

Retrieved from https://inl.gov/content/uploads/2023/12/23-50859_R1-1.pdf

CFA is quickly finding it’s place in the CIE family of activities as seen in the INL paper above. But at its core, it really answers the questions What are my Critical Functions? And What are the impacts from a failure of my Critical Functions?

Once we have identified our Critical Functions, we can begin to decompose them into their dependencies. This includes people, process, technology, infrastructure and data. The reality is that any of these components can have a catastrophic effect on our critical functions. Yet from a risk management standpoint, these pockets of risk can be highly siloed. As cybersecurity practitioners, we are commonly only focused on technology and data, and possibly a subset of infrastructure such as data networks.

The Controls Efficacy Pyramid

Now we start to see what must be protected, we need a common frame of reference to start designing and documenting our controls. Often this approach entails looking to our vendors that sell amazing solutions to cybersecurity problems, and without proper understanding this may mean taking their promises of security nirvana at face value. Afterall, they are trusted partners, right?

But let’s look for a moment at one approach taken by the safety community in understanding the efficacy of controls as it related to hazard mitigation. The image below, based on this image from the CDC website along with many other variations on this concept. We will attempt to apply this same approach to cybersecurity. This has been shown to be effective in managing risk for process safety and other traditional risk silos like physical security as well. It’s a great way to think about preventative controls or consequence reduction, but certainly does not consider detective controls as effectively. For instance, network security monitoring and incident response is probably captured better in models such as NIST CSF, but if we are simply focusing on our ability to withstand negative events, this can be a great way to look at things.

This model consists of 5 levels, or in some variations, 6, which I describe below. The most effective of all control types, is scenarios where we can eliminate the hazard. But this is not always possible for obvious reasons. Now before we go deeper into this conversation, it’s important to cover the concept of hazard, as it’s not something we think about a lot in cybersecurity. It’s certainly a core concept in emergency management, OT and process safety though. Perhaps you have heard of an “All Hazards Plan” or possibly a HAZOPS.

An ”All Hazards Plan” is frequently captured in a Continuity of Operations (COOP) plan document, or a BCP. These are not cybersecurity focused, and in fact it is very common that cyber is not captured at all. They tend to focus on events like natural disasters, human error, active shooter, pandemic and other acts of God. We have seen cybersecurity scenarios started to be added to these documents, ransomware being a popular choice. But there’s a reason why organizations have focused here. These are the source of risk that have traditionally provided the greatest concern. These are the hazards we are talking about. In the pyramid above, a mechanical failure or a human error might be the hazard that leads to injury or death or product quality or production issues. Those are the consequences of this hazard.

When we talk about ransomware as a hazard, or broader cybersecurity concerns, it’s helpful to understand how that cybersecurity issue can ultimately lead to a consequence. Why should it be considered a hazard? But it’s these hazards that we need to mitigate. To bring it back to cybersecurity terms, these hazards are closely related to threats and consequences, but they are not the same thing.

  • Threats are the source of harm. Perhaps it is a cyber adversary. But threats may not be human, entropy in a mechanical part could be a threat too. Mother Nature. Squirrels.
  • Vulnerabilities are the weakness that the threat abuses. The mean time before failure in that mechanical component might be insufficient to withstand extended use. Or perhaps a CVE that is identified in a critical system software component.
  • Hazards are what the Threat creates by exploiting the vulnerability. That mechanical component failing creates a hazard for the human operator who is now presented with an unsafe system. Or that cyber vulnerability may present a Loss of View or a Loss of Control for system operations. We frequently think of that as the Consequence, but Loss of Control on its own is not sufficient to produce a consequence until it impacts our Critical Function.
  • Consequences are what ultimately occur. Perhaps injury or death. Perhaps it is some type of damage to the system or a failure in the critical function. Remember, this is what we are trying to protect here, not the system itself.

So, to bring it back to the pyramid, we are really mitigating hazards here. That can happen upstream by targeting threats or vulnerabilities in the system, but ultimately this is hazard mitigation.

Elimination

The most effective means to mitigate the risk of a hazard, is for the hazard to not exist at all. If it’s not there, it can’t hurt us. In terms of an engineered system, this could be a chemical harmful to human life that can be removed. Or perhaps a cyber asset that is not necessary and can be removed. In terms of Cyber Informed Engineering, the concept of Design Simplification speaks directly to this.

It asks the question “How do I determine what features of my system are not absolutely necessary to achieve the critical functions?” In this way we can completely engineer out the risk by simply removing the hazard. There may be other hazards, but at least this one cannot be realized into a consequence.

Substitution

If we cannot remove the hazard, perhaps we can replace it with something that is less risky. Perhaps a harmful chemical can be replaced with one that is safer for human use. Or a dangerous and rusty bar can be replaced with an anodized aluminum surface smooth to touch. This is also where patching a vulnerability substitutes a vulnerable application, with an application that is not vulnerable (at least to that specific CVE). Other actions that change the state of the system, such as through configuration management, can also be considered substitutions. Afterall, by modifying the innate nature of a system, we are essentially replacing something that is hazardous with something that is less hazardous.

Engineered Controls

This one may get a bit confusing, especially as we foray into Cyber Informed Engineering. Engineered controls also fit into the category of system redesign. The CDC describes controls such as adding better ventilation for workers to reduce the hazard of noxious fumes as an example of an engineered control. In terms of CIE, typically we are referring the mechanical alternatives to digital controls when we use this terminology. For instance, the e-stop or manual bypass are examples of engineered controls.

From a cybersecurity perspective, engineered controls can be thought of as solutions that make it impractical for a hazard to harm us by changing the conditions in some way. A data diode only transmits network packets one way. It has been engineered with physics to function a certain way that makes bidirectional traffic flows originating from a low trust environment practically impossible. Likewise, controls such as backups and high availability create a condition that makes recovery possible, regardless of what an attacker does, but does include certain assumptions. For instance, if the backups are not properly managed, they might also become encrypted by ransomware.

Administrative Controls

This category is focused on procedural change. For instance, if a worker is at risk of harm from machinery when standing within 5 feet of the cutting blade, adding safety precautions and warning tape at 7 feet would provide a procedural mechanism to protect the worker. If they never violate the 5 feet barrier, it stands to reason that hazard cannot harm them. This of course requires they follow procedure. Reinforcing that barrier with physical bars or other engineered controls could make things even safer.

One example I have seen in cybersecurity and even fraud situations that is very effective is the use of dual authorization or dual control, also referred to as two-person approval. Perhaps two accountants must agree to a payment transaction before it can be sent. Or what we saw in the aftermath of the Solarwinds attack, two or more developers must agree before software code could be committed. This would then require collusion instead of having a single point of failure. It’s still not impossible, and it is a weaker form of a control, but it can still be useful for risk mitigation. If you can change people’s behavior, you can reduce the risk of the hazard.

Personal Protective Equipment (PPE)

The least effective controls are those that rely on a “last mile” defensive barrier. When this barrier fails, there is nothing standing in the way of harm. If you are handling dirty needles, and one punctures your latex glove, you can become infected. If your mask is not worn properly or fails, you can still contract COVID-19, and in fact even if it does not fail, it only reduces the chance of infection. If your hard hat malfunctions or falls off, and a concrete block falls on your head, you are likely to suffer injury. It does provide some protection, and we should always use PPE when appropriate, but it’s not a terribly strong safeguard.

Likewise, the security industry has promoted the use of many such products that are essentially digital PPE. We tell ourselves we will be safe because we have a firewall, or an antivirus software, and then get surprised when it fails to protect us. Our vendors make promises about the capabilities of their products but fail to inform us of the need for additional layered defenses in combination with their product. That does not mean we should not use firewalls and endpoint security tools. But it does mean we should take a commonsense approach to cybersecurity that focused on the need for layered defenses and assumed failure modes. We will talk about this a bit more in the next section.

People – the 6th layer

In some variants of this model, it has been suggested that the tip of the pyramid is really people. Training our people is an important activity to reduce the chance of a hazard materializing. We see this all the time in cybersecurity with phishing training or secure coding awareness for developers. We have excellent training options for ICs Security such as SANS courses and IACS credentials from ISA. But this is a very weak approach if we rely only on training, because invariably somebody will click the link. Visit the evil website. Or maybe they will forget the safety training and make a mistake. Human error is one of the major hazards we track in our work, and frankly this is the source of more incidents in ICS than cybersecurity in our experience. But when we get into the topic of what controls apply to, just remember this is really securing the human. And since humans can impact the secure operation of your system, it's still an important consideration.

Public Health Scenario

I wanted to close this first article with a concrete example of how you could apply these techniques. I spent close to 10 years in state government across public health and air transportation sectors, focused on emergency management, business continuity disaster recovery and cybersecurity. In some cases, it was IT centric, and in others OT was in play. But it really does not matter as this approach takes a mission-driven approach to ensure that your operational considerations are at the core of this strategy.

There are several critical functions we could model here, but let’s start with one we have all had very recent experience with. Prevent or retard pandemic outbreak. This may not feel like a cybersecurity related issue, but bear with me as we walk through this critical function.

So, we have a critical function identified, the first step is understanding how this function is delivered. What are the requirements of my function and its dependencies? What is needed to deliver my function? We need to create a list of enabling functions, these are the activities that are typically delivered at a departmental level. Who is involved in the delivery of the function? What resources do they need to deliver this function? What happens when these enabling functions are working correctly? What happens when they are not? And ultimately, what are the consequences of failure for this critical function. In the Opswright platform, we use a Tree model to map these dependencies. Let’s use one version of a tree diagram below to explain.  

What is needed to deliver my function?

In our example above where we must prevent pandemic outbreak, there are several enabling functions we quickly identify. These include lobbying with government stakeholders to drive public policy, communicating with the public, vaccinations, facility management and more. In our example, we are focusing on only a single enabling function, monitor infection rates. You would need to go through a similar exercise for each of these.

To monitor infection rates, I have other objectives that must be met including developing and delivering the test, communication, epidemiological coordination with peers, and as seen above; process test results. Through interview with subject matter experts, we determine that accuracy and availability of tests are our primary requirements. Without good data we cannot effectively monitor infection rates, which will degrade our ability to prevent outbreak.

What are the inputs?

We have several dependencies to process test results. We have physical lab facilities, lab equipment and computers, and staff to operate them. Any of which could compromise our ability to accurately produce test results.

What are the hazards?

  • Mechanical failure
  • Critical Cyber Vulnerability
  • Human Error
  • IT Maintenance Failure
  • Power failure

Using the pyramid of efficacy, we can ask ourselves what the most effective measure to use for each hazard is. Can we eliminate it? Substitute it? Redesign it? We will use a simple table below to explore this.

Using this type of approach, we can explore the most important hazards to mitigate and how we can most effectively protect our critical function. The table above is a bit abbreviated, so a brief description of a few of these is below.

  • We had 2 hazards related to human error in our examples above, so if we can automate tasks and remove the human, we can eliminate hazard related to human operator. But we do limit ourselves to the logic programmed into the automation. And then adding additional approval and confirmation controls can help.
  • We had a mechanical component that could simply be replaced before it failed. Performing proper maintenance is very important. or perhaps the component is not necessary and aredesign of the system could eliminate that from the design.
  • The cyber vulnerability could be mitigated through patching, or perhaps system hardening. This is certainly within the realm of cyber hygiene, and these approaches are helpful but not enough. Additionally sandboxing or isolating software components make them more resistant to attack. You may not have the ability to directly influence the application if you are not the supplier though.
  • Changing IT maintenance procedures such as only performing them during off hours, or even removing admin privileges may be called for in certain critical operations. Does IT need elevated access to this system? Should a compartmentalized technology management function be created for sensitive lab assets? This is certainly something to consider. restricting network access from IT networks would also be a good idea.
  • The power failure speaks to interdependencies that are important to understand and plan for. redundant circuits, battery backups, redundant power supplies, and definitely testing failover with actual interrupt testing would help ensure reliability.

You will notice that very little of this involves controls from a framework checklist. None of this is to say that you don’t still tackle cyber-hygiene, but if you want to protect your critical functions, this is an effective approach you can take.

This is exactly what we help critical infrastructure organizations do every day. When your mission is a matter of life or death, please get in touch. Opswright – Secure Engineering, By Design.

Efficacy of Controls

Tony Turner

Founder, CEO

Experienced cybersecurity executive 30+ years, Author of SANS SEC547 Defending Product Supply Chains and Software Transparency.

Author's page