Back    Zoom +    Zoom -
TENCENT Hunyuan AI Infra Open-Sources Upgraded HPC-Ops Inference Core Operators
Recommend
15
Positive
25
Negative
7
TENCENT Hunyuan announced that its HPC-Ops inference operator library has undergone a system-level upgrade, evolving from standalone operators into a comprehensive optimization suite covering the entire inference pipeline, including five key operators.

This upgrade effectively addresses real-world engineering bottlenecks on mainstream inference platforms, such as long-tail latency in Attention, GPU memory transfer overhead, and cross-card communication. Multiple performance metrics significantly outperform existing open-source baselines.

Related NewsCiti: TENCENT (00700.HK) WeChat Mini Programs Smoothly Integrating into AI Ecosystem; Reiterates "Buy"
HPC-Ops is an industrial-grade, high-performance large model inference operator library open-sourced and long maintained by the TENCENT Hunyuan AI Infra team. Key highlights of this upgrade include:

Attention: To tackle computation imbalance and long-tail inference issues caused by mixed short and long requests under real workloads, a runtime dynamic load scheduling solution is adopted. Tests show up to 2.95x acceleration for long-text scenarios and up to 17% improvement in end-to-end QPM.

Router GEMM: To achieve FP32-level high-precision computation through a dual BF16 GEMM combination, balancing inference accuracy and GPU utilization. Precision is significantly superior to conventional BF16/TF32 solutions, with up to 3.22x speedup compared with CuBLAS FP32.

Related NewsG Sachs: Cloud and Data Center Subsector Still Most Favored for 2H; Alibaba (BABA.US), GDS (GDS.US). VNET (VNET.US), Kingsoft Cloud (KC.US) Recommended
FusedMoE: To establish a full-module MoE pipeline, integrating multi-stage processes while eliminating GPU memory transfer and kernel launch overhead. Compared with mainstream frameworks such as vLLM and SGLang, performance improves by 1.2-1.6x.

Fused AllReduce+Norm: To deeply integrate cross-GPU communication, residual addition, and normalization computation. Compared with mainstream solutions including NCCL and FlashInfer, performance achieves 1.04-1.68x acceleration.

Sampler: To consolidate sampling computation in the decoding stage, originally requiring more than ten operator steps, into two CUDA kernels, significantly reducing scheduling, read-write, and synchronization overhead. Compared with vLLM, speed increases by 4.0-7.5x, and by 1.9-4.7x versus FlashInfer, addressing inference-end bottlenecks.

Related NewsTENCENT (00700.HK) Gains 3% as BofAS Says WeChat AI Agent’s Tangible Progress Supports Rating Re-rating

Auto-translated by AI
This article was automatically translated by AI, the original language version should be considered the authoritative version. AASTOCKS.com Limited does not guarantee its accuracy or completeness and accepts no liability for any damages or losses arising from the use of this translation. More Details

AASTOCKS Financial News