Nvidia dominates the AI accelerator market
Some analysts believe that Nvidia's dominance in AI accelerators has surpassed Intel's dominance in PC processors back then, allowing Nvidia to enjoy the huge industrial dividends of the AI era. According to Nvidia's first quarter financial report for the fiscal year 2025, the company's revenue reached $26 billion, a year-on-year increase of 262%; The net profit reached 14.81 billion US dollars, a year-on-year increase of 628%.
So, why can Nvidia demonstrate such strong dominance in the field of AI accelerators? The author believes that it mainly stems from three points: core chips, software ecology, and connection technology.
In terms of core chips, time can be traced back to 2020. At GTC 2020, Nvidia launched a new generation GPU based on the Ampere architecture - NVIDIA A100. This chip, as a universal workload accelerator, became a chip that domestic and foreign AI technology giants were eager to purchase, and was even banned from exporting to the Chinese market thereafter. The NVIDIA A100 showcases the "violent aesthetics" of AI accelerator development, and according to Nvidia's data, the performance of the NVIDIA A100 has skyrocketed 20 times compared to the previous generation. On the GTC 2024, Nvidia has launched the B100 chip with the architecture code Blackwell and a graphics memory capacity of 192GB. In terms of AI accelerator chips, Nvidia is not just a strong product, but has a strong product matrix. In terms of architecture, Nvidia has successively launched Volta architecture, Turing architecture, Ampere architecture, Hopper architecture, and Blackwell architecture, providing a wide range of chip product combinations such as B100, H200, L40S, A100, A800, H100, H800, V100, etc.
Meanwhile, Nvidia provides powerful software support for these products. In terms of NVIDIA's CUDA ecosystem, whether it is universal acceleration or computational acceleration, the CUDA ecosystem can provide sufficient support. Since the launch of CUDA in 2006, Nvidia has gained a huge user base due to the parallel computing capabilities of CUDA and GPU. The CUDA platform includes hardware architecture and programming models, providing developers with a more direct and efficient way to use GPUs for parallel computing. During the COMPUTEX 2023 conference, data revealed by NVIDIA showed that CUDA currently has over 4 million developers and over 3000 applications, with a staggering 40 million downloads in history. What makes CUDA even more popular at the current stage is that in 2022 alone, the download volume of CUDA reached an astonishing 25 million, and it is still in a state of rapid growth.
Nvidia's third significant advantage in AI accelerators is connectivity technology. The AI computing power cluster is an important infrastructure for the development of artificial intelligence. In the current hottest AI big model, a single card cannot support it, and a powerful AI computing power cluster is needed. If it is a cluster based on NVIDIA computing cards, the main connection technologies are NVLink and InfiniBand, which are responsible for close range parallel computing and expanding the cluster size, respectively. NVLink can directly access memory between GPUs without CPU intervention. At present, NVIDIA NVLink has been updated to the fifth generation, which significantly improves the scalability of large multi GPU systems. A single NVIDIA Blackwell Tensor Core GPU supports up to 18 NVLink 100 GB/s connections, with a total bandwidth of up to 1.8 TB/s, which is twice the bandwidth of the previous generation and up to 14 times the bandwidth of PCIe 5.0. Server platforms such as 72-GB200 GPU NVLink Domain (NVL72) utilize this technology to provide higher scalability for today's exceptionally complex large-scale models.
Meanwhile, Nvidia also has NVSwitch. The NVSwitch chip is a physical chip similar to a switch ASIC, which can connect multiple GPUs together at high speed through the NVLink interface, thereby improving the communication efficiency and bandwidth between multiple GPUs within the server. For example, the NVIDIA A100 Tensor Core GPU has introduced third-generation NVLink and second-generation NVSwitch, doubling the bandwidth per CPU and bandwidth reduction. If it is through the fourth generation NVLink and third generation NVSwitch, a system with eight NVIDIA H100 Tensor Core GPUs can be connected, with a split bandwidth of 3.6 TB/s and a reduced operating bandwidth of 450 GB/s. Compared to the previous generation, these two numbers have increased by 1.5 times and 3 times, respectively.
In summary, Nvidia has a very comprehensive layout in the field of AI accelerators, and this systematic solution is currently the best AI acceleration method, without one. According to statistical data, Nvidia currently accounts for over 90% of the AI accelerator chip market and is known as the "undisputed leader" in the field of artificial intelligence. Of course, this also means that, except for NVIDIA AI accelerator chips, it is difficult for chips from other manufacturers to capture market opportunities, even if they are related chips launched by international giants. One important reason is that Nvidia's system, which revolves around AI accelerators, is not only powerful but also closed, and has poor compatibility with non Nvidia chips. This is also known as the "Nvidia path" for the development of AI chips. What makes manufacturers very desperate is that if they choose to follow this path, their products will not be able to have a significant impact on Nvidia chips unless they are aimed at certain special purposes.
Therefore, technology giants such as Intel and Google now hope to start with connectivity and tear a hole in the Nvidia AI accelerator ecosystem to gain more market share.
UALink is ambitious but also has hidden concerns
In addition to Intel, Google, Microsoft, and Meta, members of the UALink alliance also include AMD, HP Enterprise, Broadcom, and Cisco. However, as a major core supplier, Arm has not yet participated. The main responsibility of the UALink Alliance is to oversee the future development of UALink standards.
The UALink Alliance believes that UALink and industry standards are crucial for standardizing next-generation AI data centers and implementing AI and machine learning, HPC, and cloud application programming interfaces. The team will develop a specification that defines high-speed, low latency interconnections for extended communication between accelerators and switches in AI computing cabins.
Currently, the first version proposed by the UALink Alliance, UALink 1.0, connects up to 1024 AI accelerators and is based on open standards including AMD's Infinity Fabric. AMD's Infinity Fabric adopts a distributed architecture that includes multiple independent channels, each of which can perform bidirectional data transmission. This design allows for fast and low latency communication between different cores, thereby improving overall performance. Infinity Fabric is divided into SCF and SDF. SDF is responsible for data transmission, while SCF is responsible for controlling transmission commands.
From the perspective of technological evolution, if Infinity Fabric technology becomes a major component of the UALink specification, users may be concerned about the ultimate connection efficiency of UALink 1.0. It is reported that the SDF part in Infinity Fabric is basically a derivative of the HT bus, which was originally used for serial connection of CPUs, while the usage scenario of UALink 1.0 is GPGPU, and there is a huge difference in parallel data requirements between the two. Therefore, some industry insiders believe that the UALink 1.0 specification is not expected to enter the market on a large scale, but rather lay the foundation for the UALink specification framework. However, it is difficult to pose significant challenges to the NVLink+NVSwitch system.
Of course, Broadcom and Cisco will actively improve UALink 1.0 and subsequent standards. Broadcom may launch an early Ultra Ethernet NIC in the 800Gbps Thor product, while Cisco is expected to develop products that benchmark against NVSwitch. In addition, the various giants that have already joined have their own small plans: Google has customized chips, TPUs, and Axions for training and running AI models; The new Microsoft Maia 100 chip has been tested on Bing and Office AI products and is eager to enter the market; Meta previously announced the latest version of its self-developed chip MTIA, which is a customized chip series specifically designed for AI training and inference work. Therefore, some argue that once UALink is successful, these companies with self-developed chips will become the biggest beneficiaries. However, due to the different understandings of chip, architecture, and cluster levels among their respective chip research companies, this may also drag down the development of UALink, making it appear that although UALink has a large number of innovative forces, its innovation efficiency is insufficient.
epilogue
It is reported that UALink 1.0 will be available to companies joining the alliance at the same time, and a higher bandwidth update specification, UALink 1.1, will be launched in the fourth quarter of 2024. Since Nvidia is not in this alliance and does not need to participate, it is expected that when UALink 1.1 is launched, it may choose to directly benchmark against a certain generation of NVLink. However, before UALink could fully compete against NVLink, companies including Microsoft and Meta were still aggressively competing for Nvidia's AI accelerator chips.