中兴通讯技术

2024, 02, v.30 43-49

低资源集群中的大语言模型分布式推理技术

1.电子科技大学

基金项目(Foundation):

邮箱(Email):

DOI:

313	2	6
下载次数	被引频次	阅读次数

引用本文下载本文

PDF

引用导出

GB/T 7714-2015 MLA APA Refworks EndNote NoteExpress NoteFirst

摘要全文参考文献出版信息相关文章

摘要：

探索了一种并行能力更强、具有更好兼容性的大语言模型（LLM）分布式推理范式。该范式专为弱算力、小显存环境设计。同时面向主机内外差异带宽，设计了基于通信树的高效All-Reduce组通信技术；针对小显存集群，设计了细粒度的显存管理与调度技术。最后，基于这些关键技术，构建了一套针对资源受限场景的LLM推理软件系统，旨在用数量有限的低资源设备，最大化能推理的LLM，同时通过优化通信策略与计算调度加速分布式推理。实验证明，在应用上述技术后，本方案的首词元生成延迟降低34%～61%，每秒生成词元吞吐量提升52%～150%，显存占用降低61%。

关键词： LLM分布式推理范式; 资源受限场景; 优化通信策略与计算调度;

Abstract：

A distributed inference paradigm for large language model(LLM) with stronger parallelism and better compatibility is explored,which is designed for weak computing power and small memory environments. Meanwhile, an efficient All-Reduce group communication technique based on communication tree is designed for the different bandwidths inside and outside the host, and a fine-grained memory management and scheduling technique is designed for small memory clusters. Finally, based on these key techniques, a set of LLM inference software system for resource-constrained scenarios is constructed, aiming to maximize the LLMs that can be inferenced with a limited number of low-resource devices, and at the same time accelerating the distributed inference by optimizing the communication strategy and computation scheduling. Experiments demonstrate that after applying the above techniques, the first lexical element generation latency is reduced by 34%～61%, the lexical element generation throughput per second is increased by 52%～150%, and the memory occupation is reduced by 61%.

KeyWords： LLM distributed inference paradigm; resource-constrained scenarios; communication and computation scheduling optimization;

如需获取全文，请访问cnki.net

参考文献

[1] OpenAI. ChatGPT[EB/OL].(2022-12-30)[2024-02-25]. https://openai.com/blog/chatgpt

[2] AMINABADI R Y, RAJBHANDARI S, AHMAD AWAN A, et al.DeepSpeed-inference:enabling efficient inference of transformer models at unprecedented scale[C]//Proceedings of SC22:International Conference for High Performance Computing,Networking, Storage and Analysis. IEEE, 2022:1-15. DOI:10.1109/SC41404.2022.00051

[3] NVIDIA. FasterTransformer[EB/OL].(2022-03-20)[2024-02-25].https://github.com/NVIDIA/FasterTransformer

[4] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. ACM, 2017:6000–6010. DOI:10.5555/3295222.3295349

[5] LI Z H, ZHENG L M, ZHONG Y M, et al. AlpaServe:statistical multiplexing with model parallelism for deep learning serving[EB/OL].[2024-02-25]. http://arxiv.org/abs/2302.11665

[6] Github. FlexFlow[EB/OL].[2024-02-25]. https://github. com/Flexflow/FlexFlow/tree/inference

[7] MIAO X P, SHI C N, DUAN J F, et al. SpotServe:serving generative large language models on preemptible instances[EB/OL].[2024-02-25]. http://arxiv.org/abs/2311.15566

[8] REN J, RAJBHANDARI S, AMINABADI R Y, et al. ZeRO-offload:democratizing billion-scale model training[EB/OL].[2024-02-25]. http://arxiv.org/abs/2101.06840

[9] RAJBHANDARI S, RUWASE O, RASLEY J, et al. ZeRO-infinity:breaking the GPU memory wall for extreme scale deep learning[EB/OL].[2024-02-25]. http://arxiv.org/abs/2104.07857

[10] SHOEYBI M, PATWARY M, PURI R, et al. Megatron-LM:training multi-billion parameter language models using model parallelism[EB/OL].[2024-02-25]. http://arxiv. org/abs/1909.08053.pdf

[11] HuggingFace. Hugging face accelerate[EB/OL].[2024-02-25].https://huggingface.co/docs/accelerate/index

[12] LI Y J, PHANISHAYEE A, MURRAY D, et al. Harmony:overcoming the hurdles of GPU memory capacity to train massive DNN models on commodity servers[EB/OL].[2024-02-25]. http://arxiv.org/abs/2202.01306

[13] HUANG C C, JIN G, LI J Y. SwapAdvisor:pushing deep learning beyond the GPU memory limit via smart swapping[C]//Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 2020:1341–1355. DOI:10.1145/3373376.3378530

[14] WANG L N, YE J M, ZHAO Y Y, et al. Superneurons:dynamic GPU memory management for training deep neural networks[C]//Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 2018:41–53. DOI:10.1145/3178487.3178491

[15] LUO L, NELSON J, CEZE L, et al. Parameter hub:a rack-scale parameter server for distributed deep neural network training[C]//Proceedings of the ACM Symposium on Cloud Computing.ACM, 2018:41-54. DOI:10.1145/3267809.3267840

[16] PASZKE A, GROSS S, MASSA F, et al. Pytorch:an imperative style, high-performance deep learning library[EB/OL].(2019-12-03)[2024-02-25]. https://arxiv.org/abs/1912.01703

基本信息:

DOI：

中图分类号:TP391.1;TP18

引用信息:

[1]冯文佼,李宗航,虞红芳.低资源集群中的大语言模型分布式推理技术[J].中兴通讯技术,2024,30(02):43-49.

基金信息:

请选择需要下载的pdf数据

中兴通讯技术

Summary

引用

GB/T 7714-2015 格式引文

MLA格式引文

APA格式引文