Accelerated deployment of large models on the end side
The process of deploying a large model on the end side typically includes several stages, starting with the model training stage, where a large amount of annotated data is used to train the corresponding model files. When training, it is necessary to consider the size and computational complexity of the model in order to adapt to the hardware conditions of the end device.
Next is model compression, which is usually necessary to reduce the storage and operation pressure of the model on the end device. This can be achieved through pruning, quantization, and other means to reduce the size of the model and computational complexity.
Another step is model deployment, during which the compressed model is deployed to the end device. This includes transferring the model files to the device, installing the necessary inference engine and runtime environment on the device, and other steps.
Finally, after the model deployment is completed, the end devices can use these models for inference calculations. This usually includes steps such as loading the model, preprocessing input data, calculating the model, and outputting the results.
During the deployment process on the large model side, it is necessary to consider some technical challenges and limitations. For example, the hardware conditions of end-to-end devices are often much worse than those of cloud servers, so these factors need to be fully considered in the model design and compression stages. In addition, the network bandwidth and latency of end-to-end devices may also have an impact on the real-time and accuracy of model inference.
In order to overcome these challenges and limitations, some technical tools and platforms have been developed, such as MLflow Ray Serve, Kubelow, Seldon Core, BentoML, and ONNX runtime, among others. These tools can help users more conveniently build, deploy, and manage machine learning models, thereby improving the performance and availability of the models on end-to-end devices.
Nowadays, the deployment of large models on the end side is accelerating. In the PC field, following Intel's launch of the first AI PC processor, manufacturers such as Lenovo Group, HP, and Acer have successively released multiple new AI PC products. According to reports, more than 10 laptops can run AI models locally, and a batch of new products will be launched one after another.
In the mobile phone industry, starting from the second half of 2023, Xiaomi Mobile phone manufacturers such as OPPO and vivo are adding large model capabilities to their new systems. By January 2024, all of the top 5 mobile phone markets in China, except for Apple, have already released their own end side large model products.
The advantages of deploying large models on the end side are becoming increasingly prominent. On the one hand, end-to-end deployment can reduce data transmission latency and bandwidth limitations, improve real-time performance and response speed. On the other hand, end-to-end deployment can better protect user privacy and data security, as data can be processed locally without the need to be transmitted to the cloud.
Foreign manufacturers launch chips that support large-scale model side deployment
The deployment of large models on the end side cannot be achieved without the support of chips. Intel, Qualcomm, MediaTek, and others have launched chips specifically designed for the deployment of large models on mobile devices such as PCs and smartphones. Intel has launched the first generation of Core Ultra series processors based on the Intel 4 process, the Meteor Lake. For the first time, this processor adopts Chiplet (chip) design and its own advanced Foveros packaging technology in the client CPU. It integrates an NPU (neural network processing unit), can run a 20 billion parameter large model locally, and can generate high-quality multimodal data in seconds without the need for networking.
The third-generation Snapdragon 8 mobile platform released by Qualcomm is its first mobile platform specifically designed for generative AI. This platform supports running a 10 billion parameter model on the terminal side, and generates up to 20 tokens per second for a 7 billion parameter large oracle model. It can also generate images on the terminal side through Stable Diffusion.
In addition, Qualcomm has launched the AI Hub, which is an AI model library for developers, including traditional AI models and generative AI models, capable of supporting deployment on Snapdragon and Qualcomm platforms. This model library supports over 75 AI models, such as Whisper Developers can easily access models such as ControlNet, Stable Diffusion, and Baichuan-7B and integrate them into their applications.
MediaTek has engaged in deep cooperation with Alibaba Cloud and has achieved end-to-end deployment of the Tongyi Qianwen model on the Dimensity 9300 and Dimensity 8300 mobile platforms. MediaTek's Dimensity series mobile chips, such as the Dimensity 9300 and Dimensity 8300, are high-performance and energy-efficient mobile computing platforms. These chips not only have powerful processing capabilities, but also support advanced 5G technology and generative AI technology, providing a solid foundation for the deployment of end-to-end large models.
In addition, companies such as Aixin Yuanzhi and Xindong Power Technology have also optimized their products for the deployment of large models on the end side. Aixin Yuanzhi's AX650N chip has shown significant advantages in large-scale model side deployment.
Specifically, AX650N can maintain high accuracy and efficiency when deploying large visual models such as Swin Transformer. Due to the lack of architecture optimization for MHA (Multi Head Attention) structure in most end side AI chips, network structure modifications are often required when deploying large models, which may lead to decreased accuracy and the hassle of retraining. However, The AX650N, with its unique architecture and optimization, can directly support the deployment of the original Swin Transformer. It only takes 5 minutes from test board to demo replication, and private models can run in a private environment in just 1 hour.
In addition, The AX650N also has 32 channels of video decoding/structured video processing, passive cooling, and supports low latency encoding and decoding HDMI output, USB 3.0 and other features make it very suitable for various visual perception and edge computing application scenarios. In terms of deployment on the large model side, The AX650N not only provides powerful computing power, but also offers more possibilities for practical applications through its easy deployment and low power consumption.
Xindongli Technology is an AI chip startup from the Tsinghua Department, which has launched the AzureBlade L series M.2 accelerator card for large models. This accelerator card has powerful performance, can smoothly run large model systems, and its size is only 80mm (length) x 22mm (width), making it very suitable for deployment on PC and other end side devices.
The AzureBlade L-series M.2 accelerator card has achieved compatibility with Llama 2 The adaptation of models such as Stable Diffusion has become an accelerator for promoting the deployment of large models on end-to-end devices. This M.2 accelerator card, which has a small size, strong performance, and a universal interface, can break through the limited computing and storage capabilities of end devices, providing opportunities for the landing of large models on the end side.
Write at the end
Deploying a large model on the end side is a complex process that requires consideration of multiple factors and technical challenges. However, through reasonable model design, compression and optimization, as well as the use of appropriate tools and platforms, end-to-end devices can have stronger artificial intelligence capabilities. Nowadays, with the efforts of various links in the industrial chain, the deployment of large models on the end side is showing an accelerating trend. It is expected that in the future, with the continuous progress and optimization of technology, the application of large models on the end side deployment will become increasingly widespread.