nav emailalert searchbtn searchbox tablepage yinyongbenwen piczone journalimg journalInfo searchdiv qikanlogo popupnotification paper paperNew
2025, 02, v.31 39-46
面向多算力中心协同的广域智算网络仿真架构设计
基金项目(Foundation):
邮箱(Email):
DOI:
摘要:

针对智算仿真难以满足广域网时空动态性需求的情况,提出了一种面向多算力中心协同的广域智算网络仿真架构。该架构的主要创新点包括:基于属性图模型的拓扑抽象方法,实现异构算力间不规则连接建模和不稳定网络还原;基于流感知框架的广域通信模拟架构,提供高精度网络通信仿真;事件触发的多算力中心动态调度协议,通过逻辑时钟保障跨域操作因果一致性。本架构的提出弥补了广域多算力中心背景下仿真工具的缺失,为广域智算领域的相关研究人员提供高效、可靠的仿真支持。

Abstract:

In response to the situation that intelligent computing simulation is difficult to meet the requirements of the spatio-temporal dynamics of the wide-area network, a wide-area intelligent computing network simulation architecture oriented to the collaboration of multiple computing power centre is proposed. The key innovations of this architecture comprise: 1) an attributed graph model-based topology abstraction method for modeling irregular connections among heterogeneous computing resources and restoring unstable networks; 2) a flow-aware framework-based wide-area communication simulation architecture enabling high-precision network communication emulation; 3) an event-triggered dynamic scheduling protocol for multi-computing centers that ensures cross-domain operation causal consistency through logical clocks. The proposal of this architecture makes up for the lack of simulation tools in the context of wide-area multiple computing power centers. It provides efficient and reliable simulation support for relevant researchers in the field of wide-area intelligent computing.

参考文献

[1]JIANG Z,LIN H B,ZHONG Y M,et al.MegaScale:scaling large language model training to more than 10,000 GPUs[EB/OL].[2025-03-18].https://arxiv.org/abs/2402.15627

[2]VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[EB/OL].[2025-03-19].https://dl.acm.org/doi/10.5555/3295222.3295349

[3]RADFORD A,WU J,CHILD R,et al.Language models are unsupervised multitask learners[EB/OL].[2025-03-19].https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

[4]BROWN T B,MANN B,RYDER N,et al.Language models are few-shot learners[EB/OL].(2020-07-22)[2025-03-15].https://arxiv.org/abs/2005.14165

[5]DeepSeek-AI,LIU A X,FENG B,et al.DeepSeek-V3 technical report[EB/OL].(2025-02-18)[2025-03-18].https://arxiv.org/abs/2412.19437

[6]DeepSeek-AI,LIU A X,FENG B,et al.DeepSeek-V2:a strong,economical,and efficient mxture-of-experts language Model[EB/OL].(2024-06-19)[2025-03-18].https://arxiv.org/abs/2405.04434

[7]DeepSeek-AI,GUO D,YANG D,et al.DeepSeek-R1:incentivizing reasoning capability in LLMs via reinforcement learning[EB/OL].(2025-01-22)[2025-03-16].https://arxiv.org/abs/2501.12948

[8]RASHIDI S,SRIDHARAN S,SRINIVASAN S,et al.ASTRA-SIM:enabling SW/HW co-design exploration for distributed DL training platforms[C]//Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software(ISPASS).IEEE,2020:81-92.DOI:10.1109/ispass48437.2020.00018

[9]WON W,HEO T,RASHIDI S,et al.ASTRA-sim2.0:modeling hierarchical networks and disaggregated systems for large-model training at scale[C]//Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software(ISPASS).IEEE,2023:283-294.DOI:10.1109/ISPASS57527.2023.00035

[10]WANG X Z,LI Q X,XU Y C,et al.SimAI:unifying architecture design and performance tunning for large-scale large language model training with scalability and precision[EB/OL].[2025-03-18].https://ennanzhai.github.io/pub/nsdi25spring-simai.pdf

[11]BANG J,CHOI Y,KIM M,et al.vTrain:a simulation framework for evaluating cost-effective and compute-optimal large language model training[C]//Proceedings of 57th IEEE/ACMInternational Symposium on Microarchitecture(MICRO).IEEE,2024:153-167.DOI:10.1109/MICRO61859.2024.00021

[12]RILEY G F,HENDERSON T R.The ns-3 network simulator[M]//Modeling and Tools for Network Simulation.Berlin,Heidelberg:Springer Berlin Heidelberg,2010:15-34.DOI:10.1007/978-3-642-12331-3_2

[13]SRIDHARAN S,HEO T,FENG L,et al.Chakra:advancing performance benchmarking and co-design using standardized execution traces[EB/OL].(2023-05-26)[2025-03-20].https://arxiv.org/abs/2305.14516

[14]SAMAJDAR A,ZHU Y,WHATMOUGH P N,et al.SCALE-sim:systolic CNN accelerator[EB/OL].[2025-03-20].http://arxiv.org/abs/1811.02883

[15]HUBERT B.Linux advanced routing&traffic control[EB/OL].[2025-03-20].https://www.kernel.org/doc/ols/2002/ols2002-pages-213-222.pdf

[16]MA T,LUO L,YU H F,et al.Klonet:an easy-to-use and scalable platform for computer networks education[EB/OL].[2025-03-20].https://www.usenix.org/conference/nsdi24/presentation/ma

[17]DUBEY A,JAUHRI A,PANDEV A,et al.The llama 3 herd of models[EB/OL].[2025-03-20].https://arxiv.org/abs/2407.21783

[18]YUAN A,ZHAO H,DU Z,et al.WuDaoCorpora:a super largescale Chinese corpora for pre-training language models[EB/OL].[2025-03-20].https://www.sciencedirect.com/science/article/pii/S2666651021000152

[19]冯文佼,李宗航,虞红芳.低资源集群中的大语言模型分布式推理技术[J].中兴通讯技术,2024,30(2):43-49.DOI:10.12142/ZTETJ.202402007

基本信息:

DOI:

中图分类号:TP393.2

引用信息:

[1]边彦晖,刘明远,虞红芳.面向多算力中心协同的广域智算网络仿真架构设计[J].中兴通讯技术,2025,31(02):39-46.

基金信息:

检 索 高级检索

引用

GB/T 7714-2015 格式引文
MLA格式引文
APA格式引文