Huawei Releases AI Inference Innovation Technology UCM: Achieving High Throughput, Low Latency Inference Experience, and Reducing Per-Token Inference Cost
Xinhua Net News August 12th afternoon news, at the 2025 Financial AI Inference Application Landing and Development Forum, Huawei jointly released AI Inference Innovation Technology UCM (Inference Memory Data Manager) with China UnionPay, achieving high throughput, low latency inference experience.
In today's digital era, AI development is rapidly growing. The trend of large model training has not subsided, and AI inference experience has quietly become a key factor in AI applications. According to the white paper released by Zhongxin Jian Tou at the 2025 WAIC, AI is transforming from training to inference at an accelerated rate. Under such circumstances, the importance of AI inference experience is increasingly highlighted.
Inference experience directly affects users' feelings when interacting with AI, including response time, accuracy of answers, and complex context reasoning capabilities. Data shows that foreign mainstream models have reached an output speed of over 200 tokens per second (latency 5ms) for single-user outputs, while domestic models typically fall below 60 tokens per second (latency 50-100ms). How to solve the problem of inference efficiency and user experience is pressing.
According to introductions, Huawei's AI Inference Innovation Technology UCM (Inference Memory Data Manager) released this time is a caching acceleration package centered on KV Cache, combining multiple types of caching acceleration algorithms and tools, managing the memory data generated during inference processing at different levels, expanding the inference context window to achieve high throughput, low latency inference experience, and reduce per-token inference cost.
Responsible Editor: Guo Xue Tong