Nvidia Introduces New Blackwell GPU for Trillion-Parameter AI Fashions


Nvidia’s newest and quickest GPU, code-named Blackwell, is right here and can underpin the corporate’s AI plans this yr. The chip affords efficiency enhancements from its predecessors, together with the red-hot H100 and A100 GPUs. Prospects demand extra AI efficiency, and the GPUs are primed to succeed with pent up demand for increased performing GPUs.  

The GPU can prepare 1 trillion parameter fashions, stated Ian Buck, vice chairman of high-performance and hyperscale computing at Nvidia, in a press briefing.

Programs with as much as 576 Blackwell GPUs could be paired as much as prepare multi-trillion parameter fashions.

The GPU has 208 billion transistors and was made utilizing TSMC’s 4-nanometer course of. That’s about 2.5 instances extra transistors than the predecessor H100 GPU, which is the primary clue to vital efficiency enhancements.  

AI is a memory-intensive course of, and knowledge must be quickly saved in RAM. The GPU has 192GB of HBM3E reminiscence, the identical as final yr’s H200 GPU.

Nvidia is specializing in scaling the variety of Blackwell GPUs to tackle bigger AI jobs. “This may broaden AI knowledge middle scale past 100,000 GPU,” Buck stated.

The GPU offers “20 petaflops of AI efficiency on a single GPU,” Buck stated.

Buck supplied fuzzy efficiency numbers designed to impress, and real-world efficiency numbers have been unavailable. Nonetheless, it’s doubtless that Nvidia used FP4 – a brand new knowledge sort with Blackwell – to measure efficiency and attain the 20-petaflop efficiency quantity.

The predecessor H100 supplied 4 teraflops of efficiency for the FP8 knowledge sort and about 2 petaflops of efficiency for FP16.

It delivers 4 instances the coaching efficiency of Hopper, 30 instances the inference efficiency total, and 25 instances higher vitality effectivity,” Buck stated.

Nvidia CEO Jensen Huang holds a Blackwell chip (left) and a Hopper GPU at GTC March 18, 2024

The FP4 knowledge sort is for inferencing and can enable for the quickest computing of smaller packages of information and ship the outcomes again a lot sooner. The consequence? Quicker AI efficiency however much less precision. FP64 and FP32 present extra precision computing however should not designed for AI.

The GPU consists of two dies packaged collectively. They impart by way of an interface referred to as NV-HBI, which transfers data at 10 terabytes per second. Blackwell’s 192GB of HBM3E reminiscence is supported by 8 TB/sec of reminiscence bandwidth.

The Programs

Nvidia has additionally created methods with Blackwell GPUs and Grace CPUs. First, It created the GB200 superchip, which pairs two Blackwell GPUs to its Grace CPU. Second, the corporate created a full rack system referred to as the GB200 NVL72 system with liquid cooling—it has 36 GB200 Superchips and 72 GPUs interconnected in a grid format.

The GB200 NVL72 system delivers 720 petaflops of coaching efficiency and 1.4 exaflops of inferencing efficiency. It could actually help 27-trillion parameter mannequin sizes. The GPUs are interconnected by way of a brand new NVLink interconnect, which has a bandwidth of 1.8TB/s.

Huang exhibits off new Blackwell {hardware} at GTC

The GB200 NVL72 shall be coming this yr to cloud suppliers that embrace Google Cloud and Oracle cloud. It’ll even be out there by way of Microsoft’s Azure and AWS.

Nvidia is constructing an AI supercomputer with AWS referred to as Venture Ceiba, which may ship 400 exaflops of AI efficiency.

We’ve now upgraded it to be Grace-Blackwell, supporting….20,000 GPUs and can now ship over 400 exaflops of AI,” Buck stated, including that the system shall be dwell later this yr.

Nvidia additionally introduced an AI supercomputer referred to as DGX SuperPOD, which has eight GB200 methods — or 576 GPUs — which may ship 11.5 exaflops of FP4 AI efficiency. The GB200 methods could be related by way of the NVLink interconnect, which may maintain excessive speeds over a brief distance.

Moreover, the DGX SuperPOD can hyperlink up tens of 1000’s of GPUs with the Nvidia Quantum InfiniBand networking stack. This networking bandwidth is 1,800 gigabytes per second.

Nvidia additionally launched one other system referred to as DGX B200, which incorporates Intel’s fifth Gen Xeon chips referred to as Emerald Rapids. The system pairs eight B200 GPUs with two Emerald Rapids chips. It will also be designed into x86-based SuperPod methods. The methods can present as much as 144 petaflops of AI efficiency and embrace 1.4TB of GPU reminiscence and 64TB/s of reminiscence bandwidth.

The DGX methods shall be out there later this yr.

Predictive Upkeep

The Blackwell GPUs and DGX methods have predictive upkeep options to stay in high form, stated Charlie Boyle, vice chairman of DGX methods at Nvidia, in an interview with HPCwire.

We’re monitoring 1000s of factors of information each second to see how the job can get optimally performed,” Boyle stated.

The predictive upkeep options are just like RAS (reliability, availability, and serviceability) options in servers. It’s a mixture of {hardware} and software program RAS options within the methods and GPUs.

There are particular new … options within the chip to assist us predict issues which might be happening. This characteristic isn’t trying on the path of information coming off of all these GPUs,” Boyle stated.

Nvidia can also be implementing AI options for predictive upkeep.

We have now a predictive upkeep AI that we run on the cluster degree so we see which nodes are wholesome, which nodes aren’t,” Boyle stated.

If the job dies, the characteristic helps decrease restart time. “On a really giant job that used to take minutes, doubtlessly hours, we’re making an attempt to get that all the way down to seconds,” Boyle stated.

Software program Updates

Nvidia additionally introduced AI Enterprise 5.0, which is the overarching software program platform that harnesses the velocity and efficiency of the Blackwell GPUs.

As Datanami beforehand reported, the NIM software program contains new instruments for builders, together with a co-pilot to make the software program simpler to make use of. Nvidia is making an attempt to direct builders to write down functions in CUDA, the corporate’s proprietary improvement platform.

The software program prices $4,500 per GPU per yr or $1 per GPU per hour.

A characteristic referred to as NVIDIA NIM is a runtime that may automate the deployment of AI fashions. The objective is to make it sooner and simpler to run AI in organizations.

Simply let Nvidia do the work to supply these fashions for them in essentially the most environment friendly enterprise-grade method in order that they will do the remainder of their work,” stated Manuvir Das, vice chairman for enterprise computing at Nvidia, through the press briefing.

NIM is extra like a copilot for builders, serving to them with coding, discovering options, and utilizing different instruments to deploy AI extra simply. It is without doubt one of the many new microservices that the corporate has added to the software program package deal.

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here

Stay on op - Ge the daily news in your inbox