Apple's large model training selected Google chips
In a recent technical paper released by Apple, a detailed introduction was given to Apple's large models on both the end and server sides. In the pre training stage of the large model, the Apple base model is trained on the AXLearn framework. According to Apple, the AXLearn framework is an open source project released by Apple in 2023. This framework is built on the basis of JAX and XLA, allowing the model to be efficiently and scalable trained on various hardware and cloud platforms, including TPU and cloud and local GPUs.
Apple uses a combination of data parallelism, tensor parallelism, sequence parallelism, and fully sharded data parallelism (FSDP) to scale training along multiple dimensions, such as data size, model size, and sequence length.
Among them, Apple's AFM server-side large model is Apple's largest language model, which was pre trained on 8192 TPUv4 chips configured into 8 * 1024 chip clusters connected through a data center network (DCN). Pre training has three stages, starting with 6.3 trillion tokens, continuing with 1 trillion tokens, and finally expanding the context length with 100 billion tokens.
On the end side model of AFM, Apple has significantly trimmed it. The paper reveals that the AFM end side model is a model with 3 billion parameters, distilled from a server model with 6.4 billion parameters, which was trained on a complete 6.3 trillion parameter set.
Unlike the server-side model, the AFM side model uses Google's TPUv5 chip. The information in the paper shows that the AFM side model was trained on a cluster consisting of 2048 TPUv5p chips.
Google released TPUv5p in December last year, aimed at cloud AI acceleration, which Google called "the most powerful, scalable, and flexible artificial intelligence accelerator to date".
TPUv5p can provide 459 teraFLOPS (459 trillion floating-point operations per second) computing power at bfloat16 precision; At Int8 precision, TPU v5p can provide 918 teraOPS (able to perform 918 trillion integer operations per second); Supports 95GB of HBM memory with a bandwidth of up to 2.76 TB/s.
Compared to the previous generation TPU v4, TPUv5p achieves twice the number of floating-point operations per second, three times the memory bandwidth of the previous generation, 2.8 times the speed of training large models, and 2.1 times the cost-effectiveness of the previous generation.
In addition to Apple, other companies currently using Google's TPU series chips for large-scale model training include Google's own Gemini and PaLM, as well as the Claude large-scale model launched by Anthropic, founded by the former vice president of OpenAI. Last month, Anthropic released Llama 3.1 405B, which is also considered the strongest open-source large-scale model.
The examples of Apple, Google, and Anthropic have demonstrated the ability of TPU in training large models. However, compared with Nvidia, TPU is still only the tip of the iceberg in the field of large models. More large model companies behind it, including OpenAI, Tesla, ByteDance and other giants, and Nvidia GPU is still widely used in the main AI data centers.
The challengers of NVIDIA
The software ecosystem built around CUDA has always been Nvidia's biggest moat in the GPU field, especially with the acceleration of AI development and the booming market. Nvidia's GPU+CUDA development ecosystem is even more stable. Although AMD, Intel and other manufacturers are striving to catch up, they have not yet seen the possibility of threatening Nvidia's position.
But the booming market inevitably attracts more players to enter the game, challenging Nvidia, or in other words, hoping to get a share in the vast market space of AI.
Firstly, Nvidia's biggest rival in the GPU field is AMD. In January of this year, researchers used about 8% of the GPUs in the Frontier supercomputer cluster to train a large-scale GPT 3.5-level model. The Frontier supercomputer cluster is entirely based on AMD hardware, consisting of 37888 MI250X GPUs and 9472 Epyc 7A53 CPUs. This study also breaks through the difficulty of advanced distributed training models on AMD hardware and verifies the feasibility of training large models on the AMD platform.
At the same time, the CUDA ecosystem is gradually breaking through. In July of this year, British company Spectral Compute launched a solution that can natively compile CUDA source code for AMD GPUs, greatly improving the compatibility efficiency of AMD GPUs with CUDA.
Intel's Gaudi 3 was also directly benchmarked against the Nvidia H100 when it was released, claiming to improve model training speed and inference speed by 40% and 50% respectively compared to the Nvidia H100.
In addition to chip giants, there are also impacts from start-up companies. For example, Groq's LPU, Cerebras' Wafer Scale Engine 3, Etched's Sohu, and so on. On the domestic front, there are start-up companies that have taken the path of multi card cluster training. For example, in June of this year, Moore Thread announced a successful cooperation with Feather Technology to achieve training compatibility and adaptation between Moore Thread's KUAE thousand card intelligent computing cluster and Feather series model solutions, efficiently completing the training and testing of the 7 billion parameter Feather human language model YuRen-7b.
The Moore Thread Kua E solution is based on the fully functional MTT S4000 GPU, which uses the third-generation MUSA core and supports 48GB of video memory capacity and 768GB/s of video memory bandwidth on a single card. The FP16 computing power is 100TFLOPS. It is worth mentioning that the MTT S4000 computing card, with the help of Moore's Thread's self-developed development tool, can fully integrate with the existing CUDA software ecosystem and achieve zero cost migration of CUDA code to the MUSA platform.
Tiantian Zhixin also collaborates with Zhiyuan Research Institute and Aite Yunxiang to provide Tiangai 100 acceleration cards, build computing power clusters, and provide full technical support, realizing the CodeGen project for large-scale models based on independent general-purpose GPUs. It generates usable C, Java, and Python code through Chinese descriptions to achieve efficient coding.
Additionally, it is worth mentioning that there is another AI chip company in China that follows the TPU route - Zhonghao Xinying. The company launched the first mass-produced TPU AI training chip in China, "Instant," at the end of 2023. It is said that compared to the Nvidia A100, it improves performance by nearly 150%, reduces energy consumption by 30%, and has a unit computing cost of only 42% of A100 when processing large model training and inference tasks.
Of course, in addition to chip companies, according to the existing information, the current mainstream cloud service providers, such as Google, Amazon, Microsoft, Meta, Alibaba, ByteDance, Baidu, Huawei, etc., have their own chip layout, including chips for AI big model training.
Write at the end
In the long run, self-developed chips are one of the effective ways for cloud service providers to reduce computing power costs. When AI model training becomes an important use of cloud computing, self-developed AI training chips are naturally also a long-term solution for cloud service providers. Apple, as a consumer electronics giant, has taken an important step away from its reliance on Nvidia's computing power, and there are still many challengers lurking around. A spark can start a prairie fire, and Nvidia's position in the field of AI training may not be as solid as it appears on the surface.