
The latest test of speed in an artificial intelligence (AI) nervous network training is only partially about the fastest chips from Nvidia, AMD and Intel. Rapid, speed is also about connections made between chips, or computer networking approaches that include the fight of vendors and technologies.
Apart from this: Tech Prophet Mary Mekar left a big report on just AI trend – here is your TL; Dr.
Mlcommons, which benchmark the AI system, Announced on Wednesday The latest score by Nvidia and others, called MLPRF training, in a two-year-old report, how long it takes to train a neural network such as a large language model (LLM) “for convergence,” meaning, unless the nerve network cannot perform until a specified level of accuracy.
The latest results suggest how big the AI systems have grown. Scaling of chips and related components is making AI computers depending on the connection between chips anytime.
This round, called 5.0, is the twelfth installment of training tests. In six years since the first testing, 0.5, the number of GPU has increased from 32 chips, with 5.0, with 8,192 GPU chips systems with a system with a system.
Also: 4 ways are using AI to solve problems and create real values.
Because the AI systems are scaling for thousands of chips, and, in the real world, tens of thousand, hundreds of thousands, and, eventually, millions of GPU chips, “network, and configurations of the network, and algorithms become more important algorithms used to map the problem on the algorithms network,” David Kunte’s head, prominent, prominent of a media briefing.
Most of the AI is a case of simple mathematics, linear algebra, such as a vector is multiplied by a matrix. Magic occurs when those functions are parallel in parallel in several chips, with different versions of the data.
Also: 5 ways to convert magic saving the time of AI into its productivity superpower
“One of the simplest methods that is with something called data equality, where you have the same (AI) model at several nodes,” said, Kunter said, referring to some parts of a multi-chip computer, referring to some parts of the computer, which are called nodes, which can act freely to each other. “Then the data just comes, and then you convey those results,” in all parts of the computer, he said.
“Networking is quite internal for this,” Kunter said. Referring to the arrangement of chips, “You will often see different communication algorithms that are used for different topology and different parameters and how they are connected,” topology “.
The largest system in this training round with 8,192 chips was presented by Nvidia, whose chips, as usual, turned into the fastest score for all benchmark tests. The NVIDIA machine was designed using its most common part in production, in combination with H100 GPU, Intel CPU chips, 2,048 of them.
A more powerful system, however, debuted: NVDA’s joint CPU-GPU part, Grace-Blackwell 200. It was recorded in a joint attempt between IBM and AI Cloud-hosting giants, taking a whole tool rack as a machine, as a machine, called NVL 72.
Also: Hidden data crisis is a threat to your AI change plans
The largest configuration presented by Coreweave and IBM carries 2,496 Blackwell GPU and 1,248 Grace CPUS. (While GB200 NVL72 was presented by IBM and Coreweve, the design of the machine is of NVIDIA.)
The Benchmark Drew A Record 201 Performance Submissions from 20 Submitting Organizations, Including Nvidia, Advanced Micro Devices, Asustek, Cisco Systems, Coreweave, Coreweweve, Coreweweve, Coreweeve, Dell Technologies, Gigacomputing, Google Cloud, Hewlett Packard Enterprise, IBM, KRAI, Lambda, Lenovo, Mangoboost, Nebius, Oracle, Quanta Cloud Technology, SCITIX, SupermicroP, and TIPERMICORP.
The latest round of the benchmark included seven individual functions, including training the burt large language model and training the stable spread image-generation model.
In this period, in addition to a new test of motion, it was seen: Lama 3.1 405B of Meta platforms takes it how fast it takes to fully train the large language model. The fastest system, the fastest system, was completed within just 21 minutes on the NVIDIA 8,192 H100 machine. The Grace-Blackwell system with 2,496 GPU was not far behind for more than 27 minutes.
Full results and machine glasses Mlcommons can be seen on site,
Within those numbers, there is no precise remedy for how much a role networking plays in the huge system. Testing from a generation of Mlperf improve another show on the same benchmark on the same benchmark, even with the same number of chips.
For example, the best time to train steady spread using 64 chips at a time fell from 10 to three minutes in the pre -round, the final decline. Better versus of chips is difficult to say how many drops are due to better networking and system engineering.
Also: Openai wants Chatgpt to be your ‘super assistant’ – what does it mean
Instead, participants in Mlperf paid attention to several factors that can create average performance difference.
“Connection scalability is more important because you have to score the size of the network,” said Autvunirun, the creator of Mangobost, the creator of Smartnic Technology and Software in the same media briefing. Mangobost deposited machines collected with eight, 16, and 32 advanced micro devices’ Mi300x GPUS, which compete with NVIDIA chips.
The element of the connection scalability said, Ausavung Nirun said that “not only how fast the calculation will take or the memory will be, but how much the network becomes a bottleneck and it has to be accelerated. It becomes more and more important because you grow” the number of chips.
Different networking technologies, such as Ethernet, and individual networking protocols, such as TCP-IP, “these different (AI) models are really how effective throwputs are, there are different characteristics in this context,” said Chetan Kapoor with Korvive, who presented NVDia NVL 72 in the same media briefing.
Also: 30% of Americans are now active AI users, say new comscore data
Such a difference in throwput “maps directly for system use,” he said, meaning how efficiently it uses chips to perform those linear algebra operations.
“I think it is also a factor that the industry is making a lot of progress, on which the boundaries of effective network usage continue to forward,” Kapoor said.
Navidia’s achievement of Nvidia’s “unprecedented scaling efficiency” is running inside its machines, said Dave Salveter, director of NVidia’s quick computing products in a separate media briefing.
Salveter stated that the 2,496-way grace-blackwell NVL72 was able to get 90% of the scaling efficiency, which means that the performance of the machine almost improves the direct proportion of how many chips are connected together.
To reach the level of efficiency, NVIDIA created most of its NVLINK communication technology that combines chips, the Salveter said. “This is also our ability to do things like our collective communication libraries, NCCL, and overlap computing communication, such as really to achieve that best scaling skills,” This is also, “the Salveter said.
Also: A 5-level structure of salesfors for AI agents is finally cut through publicity.
Although the role of networking is difficult to isolate, although it improves with the same number of chips from one round to the next, the results strengthen the constant value of large and large systems, which has been an article of faith in the AI region. The increase in the number of chips is dramatically reduced training time.
Kanter showed a graph comparing improvement in testing time since 0.5 rounds. Speed-up is faster than individual reforms of any single computer chip, he said, okay because making machine is an entire system problem that includes things like network efficiency.
“You can see that through a combination of silicon architecture, algorithm, scale, everything, we are beating the law of Moore,” Kunter said, referring to the decades -old semiconductor industry rules of the thumb for progress in the transistor. This speed-up is especially the case, he said, “At some of the most pressurized workload of the day, things like generic AI.”
“It is actually installing a very high bar,” said Kunter.