Six security risks in High Performance Computing (HPC)

This is a two part blog. Today we’re going to talk about the risks many companies face in securing High Performance Computing (HPC) environments and then tomorrow we’ll talk about ways to properly secure such HPC systems.

Large-scale commodity cluster systems are finding increasing deployment in academic, research, and commercial settings. Coupled with this increasing popularity are concerns regarding the security of these clusters.

While a great deal of effort has been expended in creating tools to aid in the installation, administration, and monitoring of clusters, very little effort has been expended in creating tools that address the unique issues of cluster security, particularly for very large cluster installations

Many people believed that the issues related to cluster security were the same as for general computer security. (“What works for one system should work for a collection of 100 systems.”) However, as cluster systems have become more widespread and powerful, they have become increasingly desirable targets to attackers.

Researchers at the National Center for Supercomputing Applications (NCSA), Department of Computer Science, University of Illinois at Urbana-Champaign found six ways in which cluster security is different from traditional enterprise-level security. In order to be effective, cluster protection schemes must take these into account.

  1. A cluster encompasses a collection of distributed resources to be protected. By definition, clusters are multiple, closely-coupled machines that are centrally administered. These machines share common resources such as network access, compute cycles, and storage. The challenge is to secure these internal distributed resources against unauthorized access while at the same time permitting easy access by legitimate users. In contrast, the resources found in a typical enterprise-type environment are often very loosely coupled and exhibit minimal coherence of these types of resources.
  2. A cluster must provide mechanisms for resource management. The challenge here is to manage a cluster such that legitimate users can consume resources efficiently in an authorized way using an agreed-upon job prioritization system. This is distinguished from enterprise-type environments that usually do not need to manage resources between competing interests. When a user executes a job on a cluster, it is often difficult to differentiate legitimate versus illegitimate use unless there are obvious malicious process signatures. For example, legitimate cluster users are potentially able to tamper with shared data or to excessively consume compute cycles to the extent of disrupting the service available to other cluster users.
  3. Clusters present a heterogeneous management environment. That is, a cluster may be composed of different hardware and software node configurations (heterogeneous clusters). Even in the case of clusters containing the same hardware and software node configurations, there is usually a separation of cluster nodes by specialized function into “head nodes,” “compute nodes,” “storage nodes,” and “management nodes.” The challenge is to coordinate security across different node platforms and different specialized function nodes. This is different from enterprise-type security in that cluster security management must be simultaneously platform independent and specialized for different-functioning node types.
  4. Clusters have large-scale management requirements. As Schneier points out, security is a process, not a product . As the sizes of clusters continue to increase, the task of maintaining and monitoring cluster security becomes an intractable problem. For example, one production cluster at NCSA consists of 1,500 nodes. At this scale, it is not practical to manage a cluster without leveraging the use of automation in conjunction with human interaction. Because of the heterogeneous management environment described above, tools to automate security management need to be aware of the similarities (and differences) present among cluster resources. In this way, cluster security is different from enterprise-level security because the tools that target enterprise-level security typically assume that every resource is subtly different.
  5. Clusters, considered as a coherent unit, exhibit characteristic behavior different from non-clustered machines. This is exhibited in network traffic patterns, number of bytes transferred, applications executed, and compute loads. The challenge is first to identify, and later to understand, these behaviors via profiling in order to provide appropriate protections.
  6. Finally, and perhaps most relevant to the idea that cluster security is an evolving concept, cluster resources exhibit dependent risk. In enterprise-level security, a single compromise on a machine may result in unauthorized access, destruction of data, and a platform for future attacks. However, a compromised machine in an enterprise can be quarantined to prevent cascading damages. In contrast, the security of the resources in a cluster environment is dependent on the integrity of all nodes. A single compromised node in a cluster represents a dramatically-increased risk to the rest of the cluster nodes due to the fact that many nodes share identical configurations. In this way, clusters are much more vulnerable to “class break” types of attacks. Experience also suggests that security failures in clusters are worse than enterprise-level failures due to the fact that cluster users tend to coordinate access across various geographically-distributed resources. This coordination necessitates crossing security domains, and when one of these security domains is compromised, the attacker has a much easier job of compromising the other security domains.

Tomorrow: How to successfully secure HPC systems.