Huatai computer: measuring the global AI computing space from the evolution of large model

We believe that from the evolution path of the large model, the volume of the model will be further expanded, which will bring about the continuous growth of computing power demand. In the long run, the operation of the mature big model is expected to bring an incremental server market of $316.9 billion, which still has a large room for growth compared with the global AI server market of $21.1 billion in 2023. Based on this, we believe that the continuous iteration of the large model is expected to bring a large number of computing infrastructure needs, and it is recommended to pay attention to the investment opportunities in the computing industry.

Core view

Global demand for AI computing power continues to rise

With the continuous iteration of the large model, the capability of the model is constantly enhanced, which is the result of the increasing number of model parameters and data sets under "Scaling Law". We believe that from the evolution path of the large model, the volume of the model will be further expanded, which will bring about the continuous growth of computing power demand. Specifically, the demand for computing power of large models is reflected in three links: pre-training, reasoning and optimization. According to our calculation, taking the 100 billion parameter model as an example, the total computing power of the three links is about 180,000 PFlop/s-day, which corresponds to 28,000 pieces of A100 equivalent GPU computing power. In the long run, the operation of the mature big model is expected to bring an incremental server market of $316.9 billion, which still has a large room for growth compared with the global AI server market of $21.1 billion in 2023. Based on this, we believe that the continuous iteration of the large model is expected to bring a large number of computing infrastructure needs, and it is recommended to pay attention to the investment opportunities in the computing industry.

The volume of the model is getting bigger and bigger, which drives the demand of computing power construction.

Large Language Model (LLM) is a model pre-trained on a large number of data sets, which shows great potential in dealing with various NLP tasks. The appearance of Transformer architecture opens the way for the evolution of large-scale models. With the increasing number of decoding modules, the parameters of the model continue to increase, and gradually evolve into different versions of models such as GPT-1, GPT-2, GPT-3, PaLM, Gemini, etc., and the parameters also increase from billions to billions and trillions. We can see that the evolution of each generation of models has brought about the enhancement of capabilities, and a very important reason behind it lies in the growth of parameters and data sets, which has brought about the continuous improvement of model perception, reasoning and memory. Based on the scaling law of the model, we believe that the future model iteration will continue the path of larger parameters and evolve more intelligent multi-modal capabilities.

The computing power requirements of large models are embodied in: pre-training, reasoning and optimization.

From the perspective of disassembly, the computing power demand scenarios of the large model mainly include pre-training, Finetune and daily operation. For the three parts of computing power demand, our calculation ideas are as follows: 1) pre-training: based on the assumption of "Chinchilla scaling law", the amount of calculation can be described by formula C≈6NBS; 2) Reasoning: Based on ChatGPT traffic, the amount of calculation can be described by formula C≈2NBS; 3) Tuning: Backstepping by tuning the required GPU core hours. Taking the pre-training/reasoning/optimization of the 100 billion parameter model as an example, the computational power requirements of the three links are 13889, 5555.6 and 216 PFlop/s-day respectively. We believe that under the blessing of Scaling Law, with the growth of model volume, the demand for computing power is expected to be released continuously.

Infrastructure demand is expected to continue to be released, focusing on investment opportunities in computing industry.

Combined with the calculation of computing power demand for pre-training/reasoning/optimization of large models, we predict that the demand for A100 equivalent GPU will be 28,000 from development to mature operation of a 100 billion model. According to our calculation, the operation of the mature big model is expected to bring the global AI server incremental market of $316.9 billion. In contrast, according to IDC, the global AI server market will be 21.1 billion US dollars in 2023, and it is estimated that CAGR will reach 22.7% in 2024-2025, and there is still much room for growth in the future. In addition, considering the limited access to high-performance chips in China, the localization of AI GPU is also expected to accelerate further.

Risk warning: macroeconomic fluctuation, lower-reaches demand less than expected, and the calculation results may be biased.

main body

"Scaling Law" drives the demand for large-scale model computing power to grow continuously.

The appearance of Transformer opens the way for the evolution of large models. Large Language Model (LLM) is a model that is pre-trained on a large number of data sets, and the data is not adjusted for specific tasks. It shows great potential in dealing with various NLP (Natural Language Processing) tasks, such as natural language understanding (NLU) and natural language generation tasks. According to the development of LLM in recent years, its routes are mainly divided into three types: 1) encoder routes; 2) codec route; 3) decoder route. From the development characteristics: 1) the decoder route is dominant, which is attributed to the excellent performance of GPT-3 model in 2020; 2)GPT series models keep ahead, or it is attributed to OpenAI’s insistence on its decoder technology; 3) The model closed source has gradually become the development trend of head players, which also originated from the GPT-3 model, and companies such as Google have begun to follow suit; 4) The codec route is still developing continuously, but the number of models is less than that of the decoder route, or due to its complex structure, it has no obvious advantages in engineering implementation.

The large model may evolve towards larger parameters. We see that from GPT-1 to GPT-4 model, and from PaLM to Gemini model, the capabilities of each generation of models are constantly strengthening, and the achievements in various tests are getting better and better. As for the source of capability behind the model, we think that parameters and data sets are the two most important variables. From the scale of one billion to tens, hundreds and trillions, the increase of model parameters is similar to the increase of the number of human synapses, which brings about the continuous improvement of model perception, reasoning and memory. The increase of data sets is similar to the process of human learning knowledge, which constantly strengthens the model’s ability to understand the real world. Therefore, we believe that the next generation model will continue the route of larger parameters and evolve more intelligent multi-modal capabilities.

From the perspective of disassembly, the computing power demand scenarios of the large model mainly include pre-training, Finetune and daily operation. From the practical application of ChatGPT, starting from the framework of training+reasoning, we can further divide the computing power requirements of the large model into three parts according to the scene: 1) Pre-training: training the basic language ability of the model mainly through a large number of unmarked plain text data, and obtaining a basic large model like GPT-1/2/3; 2)Finetune: On the basis of completing the pre-trained large model, conduct two or more trainings, such as supervised learning, reinforcement learning and transfer learning, to optimize and adjust the model parameters; 3) Daily operation: based on the information input by users, the model parameters are loaded for reasoning and calculation, and the feedback output of the final result is realized.

Pre-training: the demand for computing power is expected to continue to grow under the scaling law

The pre-training effect of large model is mainly determined by parameter quantity, Token quantity and calculation quantity, and it satisfies the "scaling law". According to the paper Scaling Laws for Neural Language Models published by OpenAI in 2020, in the process of large language model training, the parameters, the number of Token and the amount of calculation have a significant impact on the performance of large models. For the best performance, these three factors must be amplified at the same time. When it is not restricted by the other two factors, the model performance has a power law relationship with each individual factor, that is, it satisfies the "scaling law".

OpenAI thinks that the calculation amount of model pre-training can be described by formula C≈6NBS. According to the paper Scaling Laws for Neural Language Models published by OpenAI in 2020, the computational power (c) required for pre-training a Transformer architecture model is mainly reflected in the process of forward feedback () and backward feedback (), and is mainly determined by three variables: model parameter quantity (n), Token batch consumed in each training step (b) and iteration times required for pre-training (s). Among them, the product of b and s is the total number of Token consumed by pre-training. Based on this, we can use C≈6NBS to describe the computational power required for pre-training of large models.

Among them, OpenAI believes that the model parameter is the most important variable, and the larger the parameter, the better the model effect. OpenAI believes that as more computing becomes available, model developers can choose how much to allocate to train larger models, use larger batches, and train more steps. Assuming that the amount of computation increases by one billion times, most of the increase should be used to increase the model size in order to obtain the optimal training of computational efficiency. In order to avoid reuse, only a relatively small data increment is needed. Among the increased data, most of them can increase the parallelism by larger batch size, and the required serial training time only increases a little.

Google put forward "Chinchilla scaling law", which holds that model parameters and training data sets need to be scaled up in equal proportion to achieve the best results. According to "Training Compute-Optimal Large Language Models" published by Google DeepMind in 2022, the relationship between the model performance and the number of tokens and the amplification of parameters required for model pre-training is not linear, but the best model effect can be achieved when the number of model parameters and the number of training tokens reach a certain ratio. In order to verify this rule, Google trained a 70-billion-parameter model ("Chinchilla") with 1.4 trillion Tokens, and it was found that its effect was better than that of Gopher, a 280-billion-parameter model trained with 300 billion tokens. Further research by DeepMind found that the approximate relationship between the parameter quantity of the optimal language model and the data set size satisfies: D=20P, where d represents the number of Token and p represents the parameter quantity of the model, that is, it satisfies the "Chinchilla scaling law" at this ratio.

We estimate that the computational power required to train the 100 billion parameter model is above 10,000 PFlop/s-day. We assume that the models with different parameters and volumes all satisfy the "Chinchilla scaling law", so as to calculate the optimal data set size required by different models and the computational power required for pre-training. Taking a large language model with 100 billion parameters as an example, the number of training Token required under "Chinchilla scaling law" is 2 trillion. According to the computational formula C=6NBS proposed by OpenAI, it can be calculated that the computational power required to train the 100 billion parameter model is about 1.39X10 4 pflop/s-day. Similarly, the computational power required for training a 500 billion parameter model is about 3.47× 10 5 p flop/s-day, and that for training a 1 trillion parameter model is about 1.39× 10 6 p flop/s-day.

Reasoning: High concurrency is the main driving force of reasoning computing requirements.

The underlying architecture of GPT model is composed of decoder modules. In a large language model such as GPT, the decoding module is equivalent to the basic architecture unit, and the underlying architecture of GPT model is pieced together by stacking each other. The number of decoding modules determines the scale of the model. Generally, GPT-1 has 12 modules, GPT-2 has 48 modules and GPT-3 has 96 modules. The more modules there are, the greater the model parameters and the larger the model volume.

The decoding module realizes large model reasoning by calculating tokenized text data. According to the paper "Scaling Laws for Neural Language Models" published by OpenAI in 2020, after the large model is trained, the model itself has been fixed, and the reasoning application can be carried out after the parameter configuration is completed. In essence, the reasoning process is to traverse the parameters of the large model again. By inputting the vector encoded by the text, the result is output and converted into words through the calculation of the attention mechanism. In this process, the parameters of the model depend on the number of model layers, the number of feedforward layers, and the head of attention mechanism layer.

The computational power required in the reasoning process can be described by the formula C≈2NBS. Because the decoding module mainly performs forward propagation in the process of reasoning, the main calculation amount is embodied in text coding, attention mechanism calculation, text decoding and so on. According to the calculation formula given by OpenAI, every time a Token is input and goes through such a calculation process, the required amount of calculation is =2N+2, in which the second half of the formula mainly reflects the size of the context window. Because this part accounts for a small proportion of the total amount of calculation, the required bytes are often expressed in K level, so it is often ignored in the calculation. Finally, we get that the calculation requirement of large model reasoning is the product of a single calculation and the number of Token, that is, C≈2NBS.

With the same number of visits to ChatGPT, we expect that the computing power required for the reasoning of the 100 billion model will be more than 5000 pflop/s. According to Similarweb data, ChatGPT visited official website 1.8 billion times in March 2024. We assume that there will be 10 questions and answers for each user visit, and the number of tokens consumed in each question and answer is 800, so it is calculated that the number of tokens consumed by ChatGPT official website in April is 06 million. Considering that the computing infrastructure construction is determined according to the peak demand rather than the average demand, we further assume that the peak Token demand is five times the average. Finally, assuming that different parametric models have the same access to ChatGPT, according to the formula of C≈2NBS, the reasoning power requirements per second of 1000, 5000 and 1000 billion parametric models are 5555.6, 27777.8 and 55555.6 PFlop/s, respectively.

Tuning: The computing power demand mainly depends on the tuning times.

After the pre-training, the parameters of the large model need to be optimized to meet human needs. Generally speaking, after the pre-training, large language models need to be continuously Finetune to achieve better running results. Taking OpenAI as an example, the process of model tuning adopts human feedback mechanism (RLHF). Reinforcement learning guides model training through Reward mechanism, which can be regarded as the loss function of traditional model training mechanism. The calculation of reward is more flexible and diverse than the loss function (for example, the reward of AlphaGO is the outcome of the game), but the price is that the calculation of reward is not derivative and cannot be directly used for back propagation. The idea of reinforcement learning is to fit the loss function through a large number of samples of rewards, so as to realize the training of the model. Similarly, human feedback is not derivable, and it can also be used as a reward for reinforcement learning, resulting in reinforcement learning based on artificial feedback.

Taking ChatGPT as an example, the tuning process mainly goes through three steps. Based on the reinforcement learning technology of human feedback, the tuning process of ChatGPT is mainly divided into three steps: 1) training supervision model; 2) Training reward model; 3) Strengthen the learning of PPO parameters. After optimization, the parameters of the model will be updated, and the generated answers will be closer to the expected results of human beings. Therefore, the demand for computing power in the tuning process is actually similar to that in the pre-training, and the model parameters need to be traversed, but the data set used will be much smaller than that in the pre-training.

The computational power requirement of large model tuning can be reversed by the number of GPU kernel hours needed for tuning. For the computing power requirement of large model tuning, we use the method of actually consuming GPU core hours to push back. According to Deepspeed Chat (Microsoft’s service provider focusing on model tuning), it takes 9 hours to tune a 13 billion model, using 8 A800 accelerator cards. According to NVIDIA official website, the peak computing power of A800 accelerator card is about 312 TFLOPS(TF32, using sparse technology). According to this calculation, it takes about 0.9 PFlop/s-day to tune a 13 billion parameter model. By analogy, the computational power required for once tuning the 30 billion, 66 billion and 175 billion parametric models is 1.9, 5.2 and 8.3 PFlop/s-day respectively.

We estimate that the computing power required to tune the trillion parameter model every month is above 2000PFlop/s-day. For the convenience of comparison, we further assume that the models with different parameters are all tuned by a single A800 server instance (that is, eight A800 accelerator cards), and the training duration is proportional to the model parameters. In addition, considering the problem of tuning times, we assume that large model manufacturers need to tune the model 30 times a month. Based on this, we calculated that the computational power required for the monthly optimization of 100 billion parameter models is 216 PFlop/s-day, and that for the monthly optimization of 1 trillion parameter models is 2160 PFlop/s-day.

The demand for computing infrastructure is expected to continue to be released, paying attention to the opportunities of computing industry.

Large-scale model training/reasoning/tuning brings computing hardware requirements. At present, the mainstream method is to use AI server to carry the computing requirements of large models, and the core devices are AI GPU, such as NVIDIA A100, H100, B100, etc. According to NVIDIA, the peak computing power of a single A100 accelerator card TF32 is 312 TFLOPS (using sparse technology) and that of FP16 is 624 TFLOPS (using sparse technology). Considering that in the actual workload, multi-card interconnection is often used for model training and reasoning, it is necessary to consider the problem of effective calculation. According to the GPT-Neox-20b: An Open-Source Autoregressive Language Model published by Sid Black et al. in 2022, the effective computing power of a single card is about 117 TFLOPS(TF32, using sparse technology), that is, the effective computing power ratio is 37.5%. We assume that the effective computing power ratio of the reasoning process is equivalent to that of the training process, and the reasoning computing power of a single card is 234 TFLOPS(FP16, using sparse technology).

We estimate that the demand for A100 equivalent GPUs for training/reasoning/tuning of the 100 billion model is 28,000. For the number of computing infrastructure needed for large models, we measure it by the number of GPU/ servers. According to our calculation framework, the total demand for computing power of large model is the sum of computing power demand of pre-training, reasoning and optimization. Considering that after the model pre-training, the infrastructure such as servers will usually be used for the development of the next generation model, we assume that the computing power requirements of pre-training, reasoning and tuning will occur concurrently. In addition, we assume that the training, reasoning and tuning are all completed within one month. Based on this, it is estimated that the demand for A100 GPU for the 100 billion parameter model is 28,000, that for the 500 billion model is 218,000, and that for the 1 trillion model is 634,000. We further assume that all servers are integrated with eight A100 acceleration cards, so the demand for AI servers in the 1000, 5000 and 1000 billion parameter models is 0.3, 27,000 and 79,000 respectively.

The operation of the mature big model is expected to bring the AI server market space of $316.9 billion. According to the Research Report of China Artificial Intelligence Large Model Map released by China Institute of Science and Technology Information, as of May 2023, 202 large models have been released in the world, and the number of large models in China and the United States accounts for nearly 90% of the global total. We predict that the number of large models in the world is still increasing, but with the iteration of large models, the competition among model manufacturers will gradually become balanced. Based on this, we conservatively assume that in the future, 30 manufacturers will realize the mature operation of 100 billion parameter models, 20 manufacturers will realize the mature operation of 500 billion parameter models, and 10 manufacturers will realize the mature operation of 1 trillion parameter models. According to JD.COM, a single Inspur NF5688M6 server is equipped with eight A800 accelerator cards, and the price is 1.59 million yuan/set, which is about 220,000 US dollars/set according to 1:7.23 USD. Based on the server demand of the above different models, we estimate that the server demand of global large model manufacturers is $316.9 billion.

In contrast, the current global AI server market is only $21.1 billion, and there is still much room for growth. According to Gartner, the global AI chip market will reach 53.4 billion US dollars in 2023, and the year-on-year growth rate is expected to reach 25.7% in 2024. According to IDC, the global AI server market will reach $21.1 billion in 2023, and it is estimated that the market will reach $31.8 billion in 2025, and the CAGR will reach 22.7% in 2024-2025. In contrast, the continuous competition and mature operation of global large model manufacturers are expected to bring about a space of 316.9 billion US dollars, while the current market size is only 21.1 billion US dollars, and there is still much room for growth. We believe that with the emergence of global large-scale models and AI applications, the demand for training/reasoning/optimization is expected to drive the rapid growth of computing infrastructure construction.

Under the background of localization, domestic AI GPU is expected to accelerate the catch-up. On October 17th, 2023, the Bureau of Industry and Security (BIS) of the U.S. Department of Commerce issued export restrictions on advanced computing and semiconductor manufacturing items in China, and domestic imports of high-performance AI chips were restricted. On the other hand, we also see that there is still a gap between domestic AI GPU and overseas advanced level. Among domestic AI GPU, Atlas 300T based on Huawei Ascent 910 has strong computing performance, and FP16′ s computing performance is about 90% of NVIDIA A800 SXM without considering sparse technology. However, compared with the most advanced products such as B100 in NVIDIA, there is still a gap of at least two generations. We believe that under the background of limited import of AI chips, the localization of AI GPU is expected to accelerate, and under the technical iteration, the gap between home and abroad is expected to gradually narrow.

To sort out the industrial chain companies, please see.The original research report.

Risk warning

Macroeconomic fluctuation. If the macro-economy fluctuates, the pace of industrial transformation and the landing of new technologies may be affected, and the macro-economic fluctuation may also have a negative impact on IT investment, resulting in the overall industry growth being less than expected.

Downstream demand is less than expected. If the downstream demand for computing power is less than expected, the related computing power input will increase or be slower than expected, resulting in the industry growth being less than expected.

There may be deviations in the calculation results. In this paper, assumptions such as "scaling law" and "Chinchilla scaling law" are used in the calculation process, which is subjective to some extent. If it is inconsistent with the actual model training process, it may lead to deviation in the calculation force demand.

Related research report

Research report: "Global AI computing power demand continues to rise" April 12, 2024

This article comes from: Selected research reports of securities firms.

Reporting/feedback