Grace Hopper

Status

These compute nodes are physically installed and in the process of being configured and tested (as of March 5). It may take a few weeks before we release them for general use.

Hardware Specifications

There are two GH200 Grace Hopper nodes. The specifications for each node are:

Processor Family: NVIDIA GH200 Grace Hopper Superchip
Number of Processors: 1
Processor Type: 72 core NVIDIA Grace ARM Neoverse V2
GPU: 1 x NVIDIA H100
Internal Interconnect: NVIDIA NVLink-C2C 900GB/s
System Memory: 480 GB LPDDRX
GPU Memory: 96 GB HBM3
Memory accessible from GPU: All 576 GB of GPU and system memory is accessible from the GPU
InfiniBand: HDR200 (operating at 100 Gb/s due to upstream switches)

CPU Architecture: It's very different

These nodes use ARM, a different CPU architecture than the x86/x86_64 used by AMD and Intel. This means that code needs to be compiled for ARM. Scripts should generally still work.

This particular ARM architecture is usually labeled as aarch64 in Linux.

Batch jobs

To use these nodes, your job submissions must include:

Either -C arm or --constraint=arm
Either --os rhel9 or (preferably) submit the job from a RHEL 9 login node
A GPU request using an argument like --gpus gh200:1

Walltime limit: 1 day, subject to change.

Operating System

The OS will be our Red Hat Enterprise Linux 9 image.

Support

These nodes will have minimal software support. It takes a lot of work to maintain the software image and applications that we do, and we do not anticipate performing that same level of effort for ARM systems like these unless we make a larger investment in ARM products. That is not to say we won't support them, just that you can expect much less time and effort from staff on these specialty nodes. Ask us if you have questions, just be aware that we may not be able to dedicate the effort to help with particular applications.

Many users interested in these nodes are likely using libraries like PyTorch that should just work, or may just need minor work. Libraries like PyTorch and other very common applications are where we will focus our support effort.

Code Compilation

In the future, we will add information here about recommendations for which compilers and flags to use.

GH200 vs H100 vs H200

NVIDIA's naming system is confusing. "Grace" (the "G" in "GH200") refers to the CPU generation. "Hopper" (the "H" in "GH200") is the GPU generation. The numbers tacked onto the end appear to be specific to the combination of CPU and GPU. In this case, the GH200 contains an H100 GPU along with a 72-core Grace CPU, which is an NVIDIA ARM CPU. The GH200 does not contain an H200 GPU, which is a different product. Bigger numbers do not always mean better things, such as with the H800 which is a crippled version of the H100.

Last changed on Wed Apr 10 09:48:17 2024