NVIDIA Announces H100 NVL - Maximum Memory Server Board for Large Language Models

NVIDIA Declares H100 NVL – Most Reminiscence Server Board for Giant Language Fashions

Whereas this yr’s Spring GTC occasion would not function any new GPUs or GPU architectures from NVIDIA, the corporate remains to be within the strategy of launching new merchandise based mostly on the Hopper and Ada Lovelace GPUs it launched final yr. On the excessive finish of the market, the corporate at present publicizes a brand new variant of the H100 accelerator particularly geared toward customers of enormous language fashions: the H100 NVL.

The H100 NVL is an thrilling variant of NVIDIA’s H100 PCIe card that, in step with the instances and NVIDIA’s broad success within the area of AI, addresses a novel market: the implementation of LLM (Giant Language Mannequin ). There are some things that make this card atypical from NVIDIA’s common server fare, not the least of which is that it comes with 2 PCIe H100 playing cards that come already plugged collectively, however the huge plus is the big reminiscence capability. The mixed dual-GPU card presents 188GB of HBM3 reminiscence – 94GB per card – providing extra reminiscence per GPU than some other NVIDIA half so far, even throughout the H100 household.

Comparability of NVIDIA H100 accelerator specs
H100 NVL H100 PCIe H100 SXM
CUDA FP32 cores 2 x 16896? 14592 16896
Tensor cores 2 x 528? 456 528
Improve the clock 1.98GHz? 1.75GHz 1.98GHz
Reminiscence clock ~5.1Gbps HBM3 3.2Gbps HBM2e 5.23Gbps HBM3
Reminiscence bus width 6144 bits 5120 bits 5120 bits
reminiscence bandwidth 2 3.9TB/sec 2TB/sec 3.35TB/sec
VRAM 2 x 94GB (188GB) 80GB 80GB
FP32 vector 2 x 67 TFLOPS? 51 TFLOPS 67 TFLOPS
FP64 vector 2 x 34 TFLOPS? 26 TFLOPS 34 TFLOPS
INT8 Tensor 2 TOP of 1980 1513 TOP TOP 1980
Tensor FP16 2 x 990 TFLOPS 756 TFLOPS 990 TFLOPS
TF32 tensor 2 x 495 TFLOPS 378 TFLOPS 495 TFLOPS
FP64 tensor 2 x 67 TFLOPS? 51 TFLOPS 67 TFLOPS
Interconnection NV hyperlink 4
18 hyperlinks (900GB/sec)
NV hyperlink 4
NV hyperlink 4
18 hyperlinks (900GB/sec)
GPUs 2GH100
Transistor depend 2x80B 80B 80B
TDP 700W 350W 700-800W
Manufacturing course of TSMC 4N TSMC 4N TSMC 4N
Interface 2 PCIe 5.0
(quadruple slot)
PCIe 5.0
(double slot)
Structure hopper hopper hopper

Driving this SKU is a selected area of interest: reminiscence capability. Giant language fashions such because the GPT household are in lots of respects reminiscence capability associated, as they may shortly fill even an H100 accelerator to carry all of their parameters (175B within the case of the most important GPT-3 fashions). Because of this, NVIDIA has determined to place collectively a brand new H100 SKU that gives barely extra reminiscence per GPU than the same old H100 components, which clock in at 80GB per GPU.

Underneath the hood, what we’re taking a look at is actually a particular case of the GH100 GPU that sits on a PCIe card. All GH100 GPUs function 6 HBM, HBM2e or HBM3 reminiscence stacks, with a capability of 16GB per stack. Nonetheless, for efficiency causes, NVIDIA solely ships the common H100 components with 5 of the 6 HBM stacks enabled. So whereas there’s nominally 96GB of VRAM on every GPU, solely 80GB is out there on common SKUs.

The H100 NVL, in flip, is the legendary absolutely enabled SKU with all 6 stacks enabled. By activating the 6th HBM, NVIDIA is ready to entry the extra reminiscence and extra reminiscence bandwidth it presents. It’s going to have some materials influence on returns – how a lot that may be a intently guarded NVIDIA secret – however the LLM market is outwardly massive sufficient and keen to pay a excessive sufficient premium for almost good GH100 packages to make NVIDIA price it.

Even then, it ought to be famous that prospects haven’t got entry to almost the entire 96GB per card. Relatively, with a complete capability of 188GB of storage, they successfully get 94GB per card. NVIDIA did not go into element on this design quirk in our pre-briefing forward of at present’s keynote, however we suspect that is additionally for efficiency causes, giving NVIDIA some leeway to disable cells (or layers) defective throughout the HBM3 reminiscence stacks. The online result’s that the brand new SKU presents 14GB extra reminiscence per GH100 GPU, a reminiscence enhance of 17.5%. In the meantime, mixture reminiscence bandwidth for the cardboard is 7.8TB/second, rising to three.9TB/second for particular person playing cards.

Apart from the elevated reminiscence capability, in some ways the person playing cards throughout the bigger dual-GPU/dual-card H100 NVL intently resemble the SXM5 model of the H100 positioned on a PCIe card. Whereas the common H100 PCIe is hampered by means of slower HBM2e reminiscence, fewer energetic SM/tensor cores, and slower clock speeds, the tensor core efficiency figures NVIDIA is citing for the H100 NVL are all on par with the H100 SXM5, indicating that this card will not be additional scaled down just like the common PCIe card. We’re nonetheless ready for the ultimate, full specs for the product, however assuming every thing right here is as introduced, then the GH100s going into the H100 NVL would characterize the best assortment GH100s presently accessible.

And right here an emphasis on the plural is required. As famous earlier, the H100 NVL will not be a single GPU half, however quite is a twin GPU/twin card half and presents itself to the host system as such. The {hardware} itself is predicated round two PCIe type issue H100s linked to one another utilizing three NVLink 4 bridges. Bodily, that is just about equivalent to NVIDIA’s current PCIe H100 design, which may already be mated utilizing NVLink bridges, therefore the distinction it isn’t within the development of the two-card/four-slot behemoth, however quite within the high quality of the silicon inside. In different phrases, at present you possibly can tie strange H100 PCie playing cards collectively, nevertheless it would not match the reminiscence bandwidth, reminiscence capability, or tensor throughput of the H100 NVL.

Surprisingly, regardless of the stellar specs, the TDPs keep shut. The H100 NVL is a 700W to 800W half, scaling right down to 350W to 400W per card, the decrease restrict of which is identical TDP because the common PCIe H100. On this case, NVIDIA appears to prioritize compatibility over peak efficiency, as few server chassis can deal with PCIe playing cards greater than 350W (and even fewer greater than 400W), that means TDPs have to carry up. . Nonetheless, given the upper efficiency and reminiscence bandwidth, it is unclear how NVIDIA delivers the additional efficiency. Energy binning can go a good distance right here, nevertheless it is also a case the place NVIDIA is giving the cardboard a better than common enhance clock velocity for the reason that goal market is generally fascinated by tensor efficiency and will not mild up all the GPU directly.

In any other case, NVIDIA’s resolution to launch what is actually the very best H100 bin is an uncommon alternative given their total desire for SXM components, nevertheless it’s a choice that is sensible within the context of what LLM prospects want. Giant SXM-based H100 clusters can simply scale as much as 8 GPUs, however the quantity of NVLink bandwidth accessible between any two is hampered by the necessity to cross by way of NVSwitches. For a single dual-GPU configuration, pairing a set of PCIe playing cards is far more easy, with the exhausting hyperlink offering 600GB/second of bandwidth between playing cards.

However maybe most necessary is solely the flexibility to shortly deploy the H100 NVL into your current infrastructure. Relatively than requiring the set up of purpose-built HGX H100 service playing cards to pair GPUs, LLM prospects can merely launch the NVL H100 in new server builds or as a comparatively fast improve to current server builds. In spite of everything, NVIDIA is focusing on a really particular market right here, so the same old SXM lead (and NVIDIA’s potential to throw its collective weight) won’t apply right here.

All in all, NVIDIA is selling the H100 NVL as providing 12 instances the GPT3-175B inference throughput because the final era HGX A100 (8 H100 NVL vs 8 A100). Which for patrons seeking to deploy and scale their methods for LLM workloads as shortly as attainable, will definitely be interesting. As famous earlier, the H100 NVL brings nothing new when it comes to architectural options – a lot of the efficiency enhance right here comes from the brand new Hopper structure remodel engines – however the H100 NVL will serve a selected area of interest because the quickest PCIe H100 possibility and the choice with the most important GPU reminiscence pool.

Backside line, in keeping with NVIDIA, the H100 NVL playing cards will start delivery within the second half of this yr. The corporate is not quoting a worth, however for what is actually a high-end GH100 bin, we might anticipate them to command a most worth. Particularly in mild of how the explosion in LLM utilization is popping into a brand new gold rush for the server GPU market.

Leave a Comment

Your email address will not be published. Required fields are marked *