Latest Search
Quote
| Back Zoom + Zoom - | |
|
TENCENT Hunyuan AI Infra Open-Sources Upgraded HPC-Ops Inference Core Operators
Recommend 15 Positive 25 Negative 7 |
|
|
|
|
TENCENT Hunyuan announced that its HPC-Ops inference operator library has undergone a system-level upgrade, evolving from standalone operators into a comprehensive optimization suite covering the entire inference pipeline, including five key operators. This upgrade effectively addresses real-world engineering bottlenecks on mainstream inference platforms, such as long-tail latency in Attention, GPU memory transfer overhead, and cross-card communication. Multiple performance metrics significantly outperform existing open-source baselines. HPC-Ops is an industrial-grade, high-performance large model inference operator library open-sourced and long maintained by the TENCENT Hunyuan AI Infra team. Key highlights of this upgrade include: Attention: To tackle computation imbalance and long-tail inference issues caused by mixed short and long requests under real workloads, a runtime dynamic load scheduling solution is adopted. Tests show up to 2.95x acceleration for long-text scenarios and up to 17% improvement in end-to-end QPM. Router GEMM: To achieve FP32-level high-precision computation through a dual BF16 GEMM combination, balancing inference accuracy and GPU utilization. Precision is significantly superior to conventional BF16/TF32 solutions, with up to 3.22x speedup compared with CuBLAS FP32. FusedMoE: To establish a full-module MoE pipeline, integrating multi-stage processes while eliminating GPU memory transfer and kernel launch overhead. Compared with mainstream frameworks such as vLLM and SGLang, performance improves by 1.2-1.6x. Fused AllReduce+Norm: To deeply integrate cross-GPU communication, residual addition, and normalization computation. Compared with mainstream solutions including NCCL and FlashInfer, performance achieves 1.04-1.68x acceleration. Sampler: To consolidate sampling computation in the decoding stage, originally requiring more than ten operator steps, into two CUDA kernels, significantly reducing scheduling, read-write, and synchronization overhead. Compared with vLLM, speed increases by 4.0-7.5x, and by 1.9-4.7x versus FlashInfer, addressing inference-end bottlenecks. Auto-translated by AI This article was automatically translated by AI, the original language version should be considered the authoritative version. AASTOCKS.com Limited does not guarantee its accuracy or completeness and accepts no liability for any damages or losses arising from the use of this translation. More Details
AASTOCKS Financial News |
|
