Nvidia’s generative AI inferencing card is simply two H100s glued collectively

GTC Nvidia’s technique for capitalizing on generative AI hype: glue two H100 PCIe playing cards collectively, in fact.
At GTC this week, Nvidia unveiled a brand new model of its H100 GPU, dubbed the H100 NVL, which it says is good for inferencing giant language fashions like ChatGPT or GPT4. And if it appears to be like like two H100 PCIe playing cards caught collectively, that is as a result of that is precisely what it’s. (Effectively, it is obtained sooner reminiscence too, extra on that later.)
“These GPUs work as one to deploy giant language fashions and GPT fashions from wherever from 5 billion parameters to 200 [billion],” Nvidia VP of Accelerated Computing Ian Buck mentioned throughout a press briefing on Monday.
The shape issue is a little bit of an odd one for Nvidia which has a protracted historical past of packing a number of GPU dies onto a single card. Actually, Nvidia’s Grace Hopper superchips mainly just do that, however with a Grace CPU and Hopper GH100. If we needed to guess, Nvidia might have run into bother packing sufficient energy circuits and reminiscence onto a regular enterprise PCIe kind issue.
Talking of kind issue, the frankencard is very large by any stretch of the creativeness, spanning 4 slots and boasts a TDP to match at roughly 700W. Communication is dealt with by a pair of PCIe 5.0 x16 slots, as a result of that is simply two H100s glued collectively. The glue on this equation seems to be three NVLink bridges which Nvidia says are good for 600GB/s of bandwidth — or a little bit greater than 4.5x the bandwidth of its twin PCIe interfaces.
Whilst you may count on efficiency on par with a pair of H100s, Nvidia claims the cardboard is definitely able to 2.4x-2.6x the efficiency, not less than in FP8 and FP16 workloads.
That efficiency can seemingly be attributed to Nvidia’s determination to make use of sooner HBM3 reminiscence as a substitute of HBM2e. We’ll observe that Nvidia is already utilizing HBM3 on its bigger SMX5 GPUs. And the reminiscence would not simply have larger bandwidth — 4x in comparison with a single 80GB H100 PCIe card — there’s additionally extra of it: 94GB per die.
The playing cards themselves are geared toward large-language mannequin inferencing. “Coaching is step one — educating a neural community mannequin to carry out a activity, reply a query, or generate an image. Inference is the deployment of these fashions in manufacturing,” Buck mentioned.
Whereas Nvidia already has its bigger SXM5 H100s within the wild for AI coaching, these are solely accessible from OEMs in units of 4 or eight. And at 700W a bit, these programs should not solely scorching, however probably difficult for present datacenters to accommodate. For reference, most colocation racks are available at between 6-10KW.
By comparability, the H100 NVL, at 700W, must be a bit simpler to accommodate. By our estimate a single socket, twin H100 NVL system (4 GH100 dies) could be someplace within the neighborhood of two.5KW.
Nonetheless, anybody all for selecting considered one of these up goes to have to attend. Whereas Nvidia might have taken the straightforward route and glued two playing cards collectively, the corporate says its NVL playing cards will not be prepared till someday within the second half of the yr.
What in the event you do not want a fire-breathing GPU?
When you’re out there for one thing a little bit extra environment friendly, Nvidia additionally launched the successor to the venerable T4. The Ada Lovelace based mostly L4, is a low profile, single slot GPU, with a TDP almost 1/tenth that of the H100 NVL at 72W.

Nvidia’s L4 is a low-profile, single slot card that sips simply 72W (click on to enlarge)
This implies the cardboard, like its predecessors, could be powered completely off the PCIe bus. Not like the NVL playing cards, that are designed for inferencing on giant fashions, Nvidia is positioning the L4 as its “common GPU.” In different phrases, it is simply one other GPU, however smaller and cheaper so it may be crammed into extra programs — as much as eight to be actual. Based on the L4 datasheet, every card is supplied with 24GB of vRAM and as much as 32 teraflops of FP32 efficiency.
“It is for environment friendly AI, video, and graphics,” Buck mentioned, including that the cardboard is particularly optimized for AI video workloads and options new encoder/decoder accelerators.
“An L4 server can decode 1040 video streams coming in from totally different cellular customers,” he mentioned, leaving out precisely what number of GPUs this server wants to do this or at what decision these streams are.
This performance traces up with present 4-series playing cards from Nvidia, which have historically been used for video decoding, encoding, transcoding, and video streaming.
However similar to its bigger siblings, the L40 and H100, the cardboard will also be used for AI inferencing on quite a lot of smaller fashions. To this finish, one of many L4’s first prospects might be Google Cloud for its Vertex AI platform and G2-series VMs.
The L4 is on the market in personal preview on GCP and is on the market for buy from Nvidia’s broader accomplice community. ®