The Practicality - Usability Exploit Gap

Sam Ginzburg

2019/02/01

Categories: Side Channels Covert Channels Computer Security Tags: Computer Security

Introduction

Over the past few years, there has been tremendous fervor over the existence of various side and covert channel vulnerabilities. Some of which, such as the Meltdown [1] and Spectre [2] variants are extremely usable, and others such as [3] which are only practical for now. I hope in this post to answer what the difference between a practical attack and a usable attack is, why it matters at all, and how we can tell when attacks start to move from practical to usable. This post should not be viewed as a mathematical proof demonstrating that the two categories of vulnerabilities are unique, but more of a general approximation of the differences between the two categories, with some of my own reasoning thrown in to help explain what I think the security community in general has been observing over the past few years. You could say that this is basically just a heuristic to help evaluate threats. This post also provides an argument for determining when it makes sense to explore radical new systems designs to help secure systems from attacks and when it doesn’t.1

What Does it Mean for an Exploit to be Practical?

In the context of academic computer security, the practicality of an exploit is often demonstrated with a PoC (proof of concept). The PoC essentially validates a threat model, and is viewed as a call to action in regards to solving a security related issue. However, the existence of a PoC is more analogous to to a software beta than a finished product. Issues that concern malware developers don’t really apply to practical exploits, since the main argument behind a PoC is that the strategy used is viable, not whether or not a real life objective can be accomplished.

What Does it Mean for an Exploit to be Usable?

Usable exploits are what you could consider “production grade” exploits. The defining characteristic of usable exploits is a combination of three factors.

  1. Widespread Applicability
  2. High Probability of Success
  3. Ease of Deployment
Look Familiar?
Look Familiar?

Widespread applicability refers to the ability of an exploit to work correctly on a significant subset of computer systems. Although, analyzing whether or not a vulnerability meets this criteria can be complicated. Threat models aside, even different attackers exploiting the same threat model can have different goals, which changes how you can define “applicable”. A great example of this would be cryptolocker attacks, which seek to maximize the installation base, versus targeted attacks like Stuxnet, that target specific computer systems. While the threat model for Stuxnet is slightly different as it targeted an air-gaped system, the applicability argument is identical to the cryptolocker attacks - even for the nation state funded malware that literally tried to blow up uranium enrichment systems. The image above, obtained from a postmortem on Stuxnet done by Symantec [4] shows a list of vulnerabilities used in the creation of the Stuxnet virus. However, without context, this list of vulnerabilities could have been targeting any system on the planet! Every single exploit shown here is generalizable to millions of computer systems across the planet. When examining other APTs (advanced persistent threats), similar results can be found. While it is true that theoretically a hyper-targeted vulnerability could have been used, what we actually observe is that the vulnerabilities used are near-universal in nature. My own personal opinion, is that this is a direct result of Occam’s Razor. If an attacker can take a generalizable vulnerability and reuse it, they will always choose to take this action instead of expending extra energy to find a custom solution, since it is far more likely for the custom solution to not work as expected. As a result of Occam’s Razor, the applicability of an exploit is a huge factor in whether or not it gets used in the real world by an attacker.

High probability of success refers to how likely an exploit is to work on the first try. One thing that software engineers hate working around in computer systems is non-determinism (although I doubt I have to remind the reader about this). Even in today’s world, zero-click RCE (remote code execution) vulnerabilities don’t exactly grow on trees. Opportunities to infect victims are limited, and in the case of targeting specific individuals, you don’t want to have to get your target to click the same email 15 times in order to get an expected value of 1 to infect them. There is also a second factor to the probability of success, which depends on the capabilities of the targets. While adversaries have gotten better, so have defenses. An attacker has a general idea of what actions can get them caught, but not a perfect view of the world, so there is some inherent risk to getting caught by some endpoint security product or system administrator. The ability of an exploit to evade these defenses is critical to how useful it actually is.

Ease of deployment can be confused with the two previous conditions, but it has a specific meaning here. Ease of deployment refers to how difficult it is to use the exploit to accomplish a given goal, in combination with how hard it is for an adversary to defend from the attack.

Why Does the Difference Matter?

At a first, knee-jerk reaction, it can be easy to state that the difference doesn’t matter. If a PoC exists, that is the boundary at which point a patch should go out. I agree with this statement 100%, as we live in an increasingly more paranoid world. But there is also the second, less talked about boundary, which is the point at which it begins makes sense to radically depart from previous designs in order to truly fix the problem. This is when an exploit crosses over from being practical, to becoming usable. As it turns out, there are many cases where we can patch practical exploits without requiring a radical redesign of our software or hardware systems.

Understanding the difference between these two points in time is critical, as before attacks become fully usable it is still viable to patch vulnerabilities on a case by case basis. However, as attackers learn how to package PoC’s into more convenient packages, it becomes unsustainable to patch bugs in this manner. We can see evidence of this today in the tweet above, where we have failed to migrate to an alternative to C-style unsafe languages in time, and as a result there are so many vulnerabilities in modern computer systems that it is no longer viable to manually patch all buffer overflows in C code in order to ensure security. The argument in this case is no longer academic or theoretical, but objective reality, as there is obvious real world evidence in the form of widespread malware that seems to never end.

The goal here is to learn from our past mistakes, and figure out how to spot this crossover point in the future, so we can avoid repeating our current scenario with newer classes of bugs.

The Dreaded Side and Covert Channel Attack

In particular, I want to focus on side and covert channel attacks in this post, as I feel history is repeating itself here. In traditional computer science literature, one of the earliest reference to side and covert channel attacks that I could find, was Butler Lampson’s work on the confinement problem [5]2.

“Covert channels, i.e. those not intended for information transfer at all, such as the service program’s effect on the system load.” - Butler W. Lampson

As Butler Lampson stated above, covert channels attacks attempt to exfiltrate information, almost always some sort of private key in the context of asymmetric cryptography, through some sort of shared resource. Without context, this is terrifying, since we traditionally define security through privilege barriers. We rely on a combination of virtual memory, CPU extensions (Intel VT-X, SGX), CPU protection rings (separation of user and kernel mode), and logical checks in our program code. However, since side and covert channel attacks bypass all of those defenses by definition, mitigating this challenge requires new solutions which for the most part result in large performance losses when applied to existing computer systems. In practice this ends up really meaning that the problem goes ignored by the vast majority of developers, since no-one wants to be that guy that trashes the performance of their project and have to explain that to their managers.

Even within the class of side and covert channel attacks, there are a wide array of threat models that target a variety of shared resources. Some of which are practical, and others usable.

Which Side and Covert Channel Attacks are Usable?

Side and covert channel attacks threat models can be illustrated via 3 key steps.

  1. Obtaining co-location
  2. Performing observations
  3. Offline analysis

Each unique side and covert channel threat model can be analyzed with this framework, and usability determined by looking at each step of the attack. In this section I hope to show a few examples of commonly discussed side and covert channel attacks, and explain why they were usable or only practical.

Meltdown and Spectre These two side channel attacks exploited shared speculation resources in order to read memory loaded into the CPU cache speculatively. Lumping these two together may seem strange, since they exploit very different architectural mechanisms, but at a high level they are used in similar ways. The threat model here, is that these attacks could be executed in a drive-by fashion by tricking a user to navigating to a website.3 The observation phase of these attacks was extremely short, as the attack could exfiltrate arbitrary data at a high rate (a little over 500KBps in the original Meltdown paper). What was great (or terrible I guess?), was that there no need for any offline analysis, as the attacks didn’t have to reconstruct any sort of private key to be useful. Since these attacks affected just about every laptop and server on the market at the time, were deterministic in nature, and easy to implement and deploy: a mass panic briefly ensued. Conclusion: Usable

Attacks on the L3 Last Level CPU Cache These particular attacks are some of the most frequently mentioned attacks in CS literature, and the papers featuring this work target the threat model of the multi-tenant cloud. In these attacks co-location is much harder to get, since it requires the cloud operator to place you next to your desired victim. There has been a fair amount of prior work on obtaining co-location in the multi-tenant cloud [6], [7], but obtaining co-location is a function of the cloud scheduler and determining that you actually ended up where you think you ended up. It turns out that most covert channel attacks that require this type of co-location aren’t generalizable or deterministic. Co-location strategies are extremely victim dependent, and furthermore require that the victim be using a public cloud at all (for obvious reasons). Not to mention that at least in the case of Microsoft Azure [8], randomized scheduling is employed. Randomized scheduling makes the attack non-deterministic, so that you now need to spin up many more VMs to eventually co-locate with your target. The exact number is a function of the excess capacity of the cloud provider, which is a semi-secret number. I wasn’t able to find exact documentation of how Microsoft accomplishes this, but I was able to find a paper by Microsoft Research employees detailing a formulation that does the same thing [9]. If their implementation roughly matches this formulation, and if we assume that Microsoft’s cloud has grown significantly since 2014 (the date of this publication),4 then it is safe to say this phase of the attack will be very non-deterministic, and take a significant amount of time. This is all ignoring the observation phase, which is further complicated by noise and limited amount of information that can be extracted. Unlike microarchitectural attacks like Spectre and Meltdown, the data exfiltrated via this channel is application dependent. Don’t get me wrong - this can be serious in the case of libraries like OpenSSL, but on a strict severity scale, it is obviously less severe and simpler to patch. One last detail, is that since these attacks target cloud providers instead of individuals, patches can be deployed universally within hours. This means the probability of the vulnerability being usable post disclosure is significantly limited, which seems irrelevant but is used in practice shockingly often.5 Conclusion: Practical

Software covert channel attacks Software side channel attacks are covert channels that are introduced through shared software resources as opposed to shared hardware resources. This essentially means that co-location means being able to run user-mode applications on that machine. The key advantage to these types of attacks, is that they aren’t affected by hardware heterogeneity, which means that they can affect more systems than attacks that utilize shared hardware resources. These attacks also tend to assume co-location has already been achieved - which greatly simplifies the attack. One example of such an attack was the shared software page cache attack on Linux and Windows [10]. The major downside to this attack was that it required the ability to call mincore on Linux, and the analogous Windows call as well. You may be asking yourself “What does mincore even do? I’ve never heard of that syscall”. Well, the paper answers this with a not so surprising result - pretty much nobody actually does. A valid defense is just to disable that syscall using some form of containerization. In addition, the spatial resolution of this specific channel was not great enough to attack many existing crypto implementations. This isn’t to say the attack was useless - the authors were able to demonstrate legitimate PoC’s for several examples. However, the examples were extremely targeted and definitely not generalizable to a larger group of applications. While in this specific instance it was easy to mitigate the attack, it is very possible we will not be as lucky with future software covert channel attacks, potentially with even greater spatial resolution. Conclusion: Practical - for now

Bridging the Gap

So we now have some loose approximate definitions for what is a practical vulnerability and what is a usable vulnerability. My goal however, is not to complain about semantics, but to use this definition to help understand when is it justifiable to argue for a dramatic shift in systems design. Dropping support for legacy applications is a taboo phrase to say in the enterprise software world - so we need to be 100% sure of our justifications before attempting to argue for dropping legacy support.

Based on my above definitions, it is clear microarchitectural attacks like Spectre and Meltdown already justify (and have justified) dramatic shifts in systems design (both hardware and software). What is less clear, is at what point can we justify a similar mass panic over the other side and covert channel attacks that have come out over the past decade or so. For now these attacks remain practical - and may not make sense for adversaries to deploy en-mass yet. My own personal theory (that is likely shared by many) is that the main thing preventing these attacks from becoming usable is not the qualities of the attacks themselves, but the qualities of other more usable vulnerabilities. What’s important to remember is that the definition of a “usable” exploit that I’ve come up with is actually relative. Usability is determined in reference to what other vulnerabilities exist out in the wild. Therefore in a strange fashion, as other vulnerabilities get patched, less usable vulnerabilities actually become more usable, simply on the basis that they still actually work.6 In an ironic sense, the appearance of fully verified operating systems such as sel4 [11] in production are the most likely crossover point in time. Although, the appearance of these systems isn’t going to be a clean barrier at which point these attacks will matter. We aren’t going to go to sleep one night and wake up the next morning with a panic attack.

Conclusion

I just want to be clear, I don’t think we are anywhere near that specific crossover point yet (even things like rust are still essentially in their infancy - let alone rewriting everything in Coq). I do think we should start thinking about more extensible long term solutions, so that when the performance hits actually do become justifiable we will be ready and able to just turn a “knob” instead of pulling a few months of all nighters to produce a half-baked solution.

TL;DR

We should patch when vulnerabilities become practical, but panic when they become usable.

[1] M. Lipp, M. Schwarz, D. Gruss, T. Prescher, W. Haas, A. Fogh, J. Horn, S. Mangard, P. Kocher, D. Genkin, and others, “Meltdown: Reading kernel memory from user space,” in 27th {usenix} security symposium ({usenix} security 18), 2018, pp. 973–990.

[2] P. Kocher, D. Genkin, D. Gruss, W. Haas, M. Hamburg, M. Lipp, S. Mangard, T. Prescher, M. Schwarz, and Y. Yarom, “Spectre attacks: Exploiting speculative execution,” arXiv preprint arXiv:1801.01203, 2018.

[3] F. Liu, Y. Yarom, Q. Ge, G. Heiser, and R. B. Lee, “Last-level cache side-channel attacks are practical,” in Security and privacy (sp), 2015 ieee symposium on, 2015, pp. 605–622.

[4] N. Falliere, L. O. Murchu, and E. Chien, “W32. Stuxnet dossier,” White paper, Symantec Corp., Security Response, vol. 5, no. 6, p. 29, 2011.

[5] B. W. Lampson, “A note on the confinement problem,” Communications of the ACM, vol. 16, no. 10, pp. 613–615, 1973.

[6] T. Ristenpart, E. Tromer, H. Shacham, and S. Savage, “Hey, you, get off of my cloud: Exploring information leakage in third-party compute clouds,” in Proceedings of the 16th acm conference on computer and communications security, 2009, pp. 199–212.

[7] Y. Zhang, A. Juels, M. K. Reiter, and T. Ristenpart, “Cross-tenant side-channel attacks in paas clouds,” in Proceedings of the 2014 acm sigsac conference on computer and communications security, 2014, pp. 990–1003.

[8] “Isolation in the Azure Public Cloud.” https://docs.microsoft.com/en-us/azure/security/azure-isolation, 2017.

[9] Y. Azar, S. Kamara, I. Menache, M. Raykova, and B. Shepard, “Co-location-resistant clouds,” in Proceedings of the 6th edition of the acm workshop on cloud computing security, 2014, pp. 9–20.

[10] D. Gruss, E. Kraft, T. Tiwari, M. Schwarz, A. Trachtenberg, J. Hennessey, A. Ionescu, and A. Fogh, “Page cache attacks,” arXiv preprint arXiv:1901.01161, 2019.

[11] G. Klein, K. Elphinstone, G. Heiser, J. Andronick, D. Cock, P. Derrin, D. Elkaduwe, K. Engelhardt, R. Kolanski, M. Norrish, and others, “SeL4: Formal verification of an os kernel,” in Proceedings of the acm sigops 22nd symposium on operating systems principles, 2009, pp. 207–220.


  1. But really this is just a long motivation and introduction section for future papers of mine.

  2. It’s possible there are earlier references I didn’t find. There are almost certainly much earlier sources for EM side channel attacks.

  3. For historical context, very soon after the Spectre/Meltdown announcement came out, there were already examples running in JavaScript posted on Github. Patches for major browsers came out quickly, and such an attack wouldn’t work today.

  4. Announcer voice: they have

  5. The British NHS had this exact situation happen to them

  6. Something that affects N > 0 machines is always more generalizable than an exploit that affects 0 machines, no matter how small N is.