How to successfully secure a HPC system

In a previous post, we talked about the inherent differences and security risks between a High Performance Computing (HPC) system (computer clusters) and a traditional enterprise network.

In order to secure a cluster it must be treated as a single unit and not as a collection of independent networked machines and thus HPC requires  a different approach  from traditional enterprise-level security.

Unique Challenges of HPC Systems

HPC systems present unique challenges to security administrators because of their following characteristics:

High bandwidth connections

To facilitate its computational goals, a cluster must have high bandwidth connections to the outside world, allowing interactive use by many users, transfer of large datasets into and out of the cluster, and fast inter-node communication. These high bandwidth connections are attractive to attackers because the attacker can subsequently leverage them for purposes such as launching denial-of-service flood attacks against other sites.

Extensive Computational Power

Legitimate cluster users marshal the aggregate processing power of multiple machines with the goal of solving grand challenge scientific problems. In contrast, this computational power could be used by an attacker for purposes such as carrying out brute-force attacks against authentication mechanisms on other network resources to which the attacker wishes to gain unauthorized access.

For example, there have been cases where attackers have used parallel versions of traditional password cracking tools running on a compromised cluster in an attempt to decrypt stolen password files. Decrypting an encrypted password typically involves either a dictionary-type attack or a brute-force search through the entire space of possible passwords Because both of these are “embarrassingly parallel” problems, a cluster gives near linear speedup for the task, thus making the computational power of a cluster an attractive target to hackers.

Massive Storage Capacity

Many high-performance cluster environments include storage capacity measured in terabytes, used for storing large scientific datasets and the results produced by computations involving these datasets. To a hacker, large-capacity disk storage is an attractive target for use in creating repositories of stolen copyrighted software and multimedia files

The result of not treating cluster security as different from non-cluster security is an increased vulnerability to attacks that simultaneously target multiple cluster components.

Researchers at the National Center for Supercomputing Applications (NCSA), Department of Computer Science, University of Illinois at Urbana-Champaign put forth the following features of a security approach that would be most successful in securing an HPC system.

Process monitoring

Examining the individual processes running on each cluster node is critical for overall cluster security. Tools based on the Clumon monitoring framework  that collect information about every process on every node in a cluster, analyze the set of processes found, and visually alert the cluster administrator when anomalous conditions are discovered are important.

Such anomalies might include system-related processes that should be running on a node but are in fact missing; processes that are running on a node when the node should be idle (particularly in the case of “root” processes); and an unusually-large number of processes running on an individual node or over the context of the entire cluster.

Detecting these types situations within a cluster is possible because a cluster presents a relatively limited search space for anomaly detection versus an enterprise network with machines of different types (servers, workstations, laptops) running an unbounded number of different software processes.

Network Port Scanning

Unexpected network ports that are opened on a cluster node can be a good indicator of suspicious activity. A port scanner that monitors ports usage tailored to a cluster environment and presents the results to cluster administrators using visualization should be required. The underlying idea is that network ports must be opened in order for an attacker to interact with a cluster, otherwise compromising a cluster is of limited value since there can be little or no interaction with compromised nodes.

Traffic analysis

Applications running on cluster systems have unique patterns of communication, making the task of distinguishing legitimate traffic from abnormal traffic difficult. This difficulty is compounded by the growing use of grid computing software that exhibit communication patterns that cross cluster boundaries by joining multiple geographically-distributed clusters into a single computational resource.

Correlation of the information from the cluster job scheduler with network traffic into and out of the cluster in order to distinguish typical cluster traffic patterns from suspicious or known malicious traffic patterns should be required. For example, an automated traffic analysis tool can use contextual information from the job manager as well as a constrained set of legitimate IP addresses belonging to one or more well-known clusters to aid in recognizing patterns of communication in parallel computations such as localized neighbor communication, many-to-many communication, or all-to-all communication.

That is, if a set of nodes are communicating with each other within the context of a single job, the traffic is most likely legitimate. This is in contrast to a machine on an enterprise network that is not attached to any unifying context. The ultimate goal is to automatically detect the types of traffic patterns.


In conclusion  to be effective, cluster security tools must monitor the state of the entire cluster, considering all facets of the cluster security problem and base decisions within this context.  And because of the performance requirements in high-performance distributed systems, it is not possible to simply retrofit existing security mechanisms and expect the HPC community to use them.