(上)GPU后,AI算力找到新方向!AI computing power finds a new direction!

新闻 · 发表于 2024-3-26 13:05

作者：U才优料

种种迹象表明，得益于自身对神经网络计算进行的专门优化，在端侧和边缘侧处理复杂神经网络算法时拥有的更高效率和更低能耗，神经网络处理器(NPU)正成为推动AI手机、AI PC和端侧AI市场前行的强大动能，并有望开启属于自己的大规模商用时代。

All signs show that thanks to its own special optimization of neural network computing, higher efficiency and lower energy consumption when processing complex neural network algorithms on the end side and edge side, neural network processors (NPU) are becoming a powerful driving force to promote AI mobile phones, AI PCS and end-side AI markets, and are expected to open their own large-scale commercial era.

什么是NPU

What is NPU

NPU是一种专为实现以低功耗加速AI推理而打造的处理器，其架构随着新AI算法、模型和用例的发展不断演进。一个优秀的、专用的定制化NPU设计必须要在性能、工号、效率、可编程性和面积之间进行权衡取舍，才能够为处理AI工作负载做出正确的选择，与AI行业方向保持高度一致。

NPU is a processor designed to accelerate AI reasoning with low power consumption, and its architecture continues to evolve with new AI algorithms, models, and use cases. A good, dedicated, customized NPU design must make trade-offs between performance, work number, efficiency, programmability, and area to make the right choice for handling AI workloads and be highly aligned with the AI industry direction.

早在2015年，面向音频和语音AI用例而设计的NPU就诞生了，这些用例基于简单卷积神经网络(CNN)并且主要需要标量和向量数学运算。从2016年开始，拍照和视频AI用例大受欢迎，出现了基于Transformer、循环神经网络(RNN)、长短期记忆网络(LSTM)和更高维度的卷积神经网络(CNN)等更复杂的全新模型。这些工作负载需要大量张量数学运算，因此NPU增加了张量加速器和卷积加速，让处理效率大幅提升。

Back in 2015, Npus designed for audio and voice AI use cases were born, which were based on simple convolutional neural networks (CNNS) and primarily required scalar and vector math operations. Since 2016, photo and video AI use cases have exploded in popularity, with the emergence of new and more complex models based on Transformer, recurrent neural networks (RNNS), Long short-term memory networks (LSTMS), and higher-dimensional convolutional neural networks (CNNS). These workloads require a lot of tensor math, so NPU added tensor accelerators and convolution acceleration to make processing much more efficient.

到了2023年，大语言模型(LLM)一比如Llama 2-7B，和大视觉模型(LVM)一比如 StableDiffusion赋能的生成式AI使得典型模型的大小提升超过了一个数量级。除计算需求之外，还需要重点考虑内存和系统设计，通过减少内存数据传输以提高性能和能效。未来预计将会出现对更大规模模型和多模态模型的需求。

By 2023, large language models (LLM) such as Llama 2-7B, and Large vision models (LVM) such as StableDiffusion enabled generative AI will increase the size of typical models by more than an order of magnitude. In addition to computing requirements, there is also a need to focus on memory and system design to improve performance and energy efficiency by reducing in-memory data transfer. The demand for larger scale models and multimodal models is expected in the future.

AI PC将NPU推上新高地

AI PC takes NPU to new heights

2024年被普遍视为AI PC元年，根据Canalys预测，到2027年，AI PC出货量将超过1.7亿台，其中近60%将部署在商用领域。为了顺应PC行业的发展潮流，并显著提高端侧AI能力，英特尔、AMD、高通等头部芯片厂商也正努力将专用NPU集成到CPU中，相关产品及路线图已经得到公布。

2024 is widely regarded as the first year of AI PC, according to Canalys forecasts that by 2027, AI PC shipments will exceed 170 million units, of which nearly 60% will be deployed in the commercial sector. In order to comply with the development trend of the PC industry and significantly improve the end-to-end AI capability, Intel, AMD, Qualcomm and other head chip manufacturers are also working to integrate dedicated NPU into the CPU, and related products and roadmap have been announced.

尽管AI PC实际市场表现取决于生态系统的协作水平，但毫无疑问的是，集成了NPU的中央处理器将驱动新一轮AI PC的发展。与此同时，如何在电脑处理器中发挥出NPU的最大功效，也成为了业内热议的话题。

Although the actual market performance of AI PCS depends on the level of collaboration in the ecosystem, there is no doubt that cpus integrated with NPU will drive a new wave of AI PC development. At the same time, how to play the maximum effect of NPU in computer processors has also become a hot topic in the industry.

2023年12月，AMD率先发布锐龙8040系列处理器，其最核心的变化之一就是新增了AI计算单元。根据AMD的说法，得益于NPU的加入，锐龙8040系列处理器的AI算力从10TOPS提升到了16TOPS，性能提升幅度达到了60%。这让锐龙8040系列处理器在LLM等模型性能更加突出，例如Llama 2大语言模型性能提升40%，视觉模型提升40%。

In December 2023, AMD was the first to release the Ryzen 8040 series processor, and one of the most core changes is the addition of AI computing units. According to AMD, thanks to the addition of NPU, the AI computing power of Ryzen 8040 series processors has increased from 10TOPS to 16TOPS, and the performance improvement has reached 60%. This makes the Ryzen 8040 series processor more prominent in the performance of models such as LLM, such as the Llama 2 language model performance improved by 40%, and the visual model improved by 40%.

一周之后，英特尔新一代酷睿Ultra移动处理器正式发布，这是其40年来第一个内建NPU的处理器，用于在PC上带来高能效的AI加速和本地推理体验，被业界视作英特尔客户端处理器路线图的转折点。英特尔方面将NPU与CPU、GPU共同作为AI PC的三个底层算力引擎，预计在2024年，将有230多款机型搭载酷睿Ultra。

A week later, Intel's new generation of Core Ultra mobile processors was officially released, which is the first processor with built-in NPU in 40 years, to bring energy-efficient AI acceleration and local inference experience on the PC, which is regarded by the industry as a turning point in Intel's client processor roadmap. Intel will NPU, CPU, GPU together as the AI PC three underlying computing engine, is expected in 2024, there will be more than 230 models equipped with Core Ultra.

来自Trendforce的消息称，微软计划在Windows12中为AI PC设置最低门槛，需要至少40TOPS算力和16GB内存。也就是说，PC芯片算力跨越40TOPS门槛将成为首要目标，这也将进一步推进NPU的升级方向，比如：提升算力、提高内存、降低功耗，芯片持续进行架构优化、异构计算优化和内存升级。

According to Trendforce, Microsoft plans to set a minimum threshold for AI PCS in Windows12, requiring at least 40TOPS of computing power and 16GB of memory. In other words, PC chip computing power across the 40TOPS threshold will become the primary goal, which will further promote the upgrading direction of NPU, such as: improve computing power, improve memory, reduce power consumption, chip continuous architecture optimization, heterogeneous computing optimization and memory upgrade.

再来看一下高通的思路。高通是不打算从一开始就只依赖NPU实现移动设备AI体验的，而是将Hexagon NPU、Adreno GPU、Kryo或Oryon CPU、传感器中枢和内存子系统“打包”，组成“高通AI引擎”。这意味着高通NPU的差异化优势在于系统级解决方案、定制设计和快速创新。通过定制设计NPU并控制指令集架构(ISA)，高通能够快速进行设计演进和扩展，以解决瓶颈问题并优化性能。目前，高通NPU从2015年初次被集成到SoC至今，在9年左右的时间里其实已经更迭了四代不同的基础架构。

Take another look at Goldton's thinking. Qualcomm does not intend to rely solely on Npus for the mobile device AI experience from the start, but instead "packages" Hexagon Npus, Adreno Gpus, Kryo or Oryon cpus, sensor hubs, and memory subsystems to form the "Qualcomm AI Engine." This means that the differentiating advantage of Qualcomm NPU lies in system-level solutions, custom design and rapid innovation. By custom-designing the NPU and controlling the instruction set architecture (ISA), Qualcomm is able to rapidly evolve and scale the design to address bottlenecks and optimize performance. At present, Qualcomm NPU from the first time in 2015 was integrated into the SoC so far, in about 9 years has actually changed four different generations of infrastructure.

本土NPU企业持续发力

Local NPU enterprises continue to make efforts

在国内厂商当中，2017年，华为最先将NPU处理器集成到手机CPU中，使得CPU单位时间计算的数据量和单位功耗下的AI算力得到显著提升，让业内看到了NPU应用于终端设备的潜力。OPPO曾经的自研NPU马里亚纳X，在拍照、拍视频等大数据流场景下实现了更好的运算效率，拉开了高端智能手机的体验差距。

Among domestic manufacturers, in 2017, Huawei was the first to integrate the NPU processor into the mobile phone CPU, making the amount of data calculated per unit time and the AI computing power per unit power consumption significantly improved, so that the industry saw the potential of NPU applied to terminal equipment. OPPO's self-developed NPU Mariana X has achieved better computing efficiency in large data stream scenarios such as photography and video shooting, opening up the experience gap of high-end smart phones.

2018年11月，作为安谋科技成立后第一款正式对外发布的本土研发IP产品，“周易”Z1 NPU在乌镇举办的第五届世界互联网大会上公开亮相；两年后的2020年10月，能够在单颗SoC中实现128TOPS强大算力的“周易”Z2 NPU面世；2023年推出的“周易”X2 NPU则主要面向智能汽车产业和边缘计算，支持多核Cluster，以及大模型基础架构Transformer，可提供最高320TOPS的算力。商业化落地方面，目前“周易”NPU已和全志科技、芯擎科技、芯驰科技等多家本土芯片厂商实现了合作。

In November 2018, as the first local R&D IP product officially released after the establishment of Amou Technology, "Zhouyi" Z1 NPU was publicly unveiled at the Fifth World Internet Conference held in Wuzhen; Two years later, in October 2020, the "Zhouyi" Z2 NPU, which can achieve 128TOPS powerful computing power in a single SoC, was launched; The "Zhouyi" X2 NPU launched in 2023 is mainly for the intelligent automobile industry and edge computing, supporting multi-core Cluster, and large-model infrastructure Transformer, which can provide up to 320TOPS of computing power. In terms of commercialization, at present, "Zhouyi" NPU has cooperated with a number of local chip manufacturers such as Quanzhi Technology, Core engine Technology and Core Chi Technology.

“周易”X2 NPU主要功能升级（来源：安谋科技）

"Zhouyi" X2 NPU main function upgrade (Source: Armou Technology)

关于我们

ABOUT US

U才优料成立于2017年，U才科技旗下品牌，倡导正能量，人才创新共享，电子行业优质人才，材料，成品优秀产业链。

Founded in 2017, the brand of U CAI Technology Group advocates positive energy, talent innovation and sharing, high-quality talents in the electronics industry, materials, and excellent industrial chain of finished products.

账号		自动登录	找回密码
密码			注册

(上)GPU后,AI算力找到新方向!AI computing power finds a new direction!

本帖子中包含更多资源