nav emailalert searchbtn searchbox tablepage yinyongbenwen piczone journalimg journalInfo searchdiv qikanlogo popupnotification paper paperNew
2024, 06, v.30 39-47
智算中心网络技术发展与应用
基金项目(Foundation):
邮箱(Email):
DOI:
摘要:

从应用子层、网卡子层、网络子层以及管控子层构成的完整技术栈出发,介绍了智算中心网络的关键技术。在分析智算中心网络发展趋势的基础上,介绍了中兴通讯在坚持核心自研的原则下,在芯片、产品和组网方案等方面开展的一系列创新。认为面向人工智能(AI)场景优化将成为智算中心网络发展的关键因素,行业必须在基础芯片、设备形态、网络架构、网络协议以及应用生态等方面做出更多努力,进一步推进算侧、端侧和网络侧关键技术的融合发展。

Abstract:

The key technologies of the intelligent computing center network are introduced from four aspects: application sublayer, network card sublayer, network sublayer, and control sublayer. ZTE Corporation has carried out a series of innovations in chip, product, and networking solutions while adhering to the principle of autonomous research and development. It is believed that artificial intelligence(AI) scenario optimization will become a key factor in the development of intelligent computing center networks, and the industry must make more efforts in the basic chip, device form, network architecture, network protocols, and application ecology to further promote the integration and development of key technologies on the computing side, end side, and network side.

参考文献

[1]KODALI R K,PRASAD UPRETI Y,BOPPANA L.Large language models in AWS[C]//Proceedings of 1st International Conference on Robotics,Engineering,Science,and Technology(RESTCON).IEEE,2024:112-117.DOI:10.1109/restcon60981.2024.10463557

[2] YELURI S. Large language models:the hardware connection[EB/OL].[2024-10-10]. https://blog. apnic. net/2023/08/10/large-language-models-the-hardware-connection

[3] TANG Z H, SHI S H, WANG W, et al. Communication-efficient distributed deep learning:a comprehensive survey[EB/OL].(2020-03-10)[2024-10-06]. https://arxiv.org/abs/2003.06307

[4] NVIDIA. NIVIDIA spectrum-X network platform architecture[EB/OL].[2024-10-06]. https://resources. nvidia. com/en-usaccelerated-networking-resource-library/nvidia-spectrum-x

[5] CISCO. Evolve your AI/ML network with Cisco silicon one[EB/OL].[2024-10-06]. https://www. cisco. com/c/en/us/solutions/collateral/silicon-one/evolve-ai-ml-network-silicon-one.html

[6] ZHANG Z, LUO L, NING Q, et al. SRNIC:a scalable architecture for RDMA NICs[EB/OL].[2024-10-03]. https://www.usenix.org/conference/nsdi23/presentation/wang-zilong

[7] ZHANG Z L, CAI D Q, ZHANG Y R, et al. Fed RDMA:communication-efficient cross-silo federated LLM via chunked RDMA transmission[EB/OL].(2024-03-01)[2024-10-08]. https://arxiv.org/abs/2403.00881

[8] TANG J, WANG X L, DAI H C. Scalable RDMA transport with efficient connection sharing[EB/OL].[2024-10-05]. https://ieeexplore.ieee.org/document/10228968

[9] ZHU Y, ERAN H, FIRESTONE D, et al. Congestion control for large-scale RDMA deployments[EB/OL].[2024-10-05]. https://dl.acm.org/doi/10.1145/2785956.2787484

[10] LI Y, MIAO R, L H H, et al. HPCC:high precision congestion control[EB/OL].[2024-10-07]. https://dl. acm. org/doi/10.1145/3341302.3342085

[11] KUMAR G, DUKKIPATI N, JANG K, et al. Swift:delay is simple and effective for congestion control in the datacenter[EB/OL].[2024-10-12]. https://dl.acm.org/doi/pdf/10.1145/3387514.3406591

[12] WANG W, MOSHREF M, LI Y, et al. Poseidon:efficient, robust,and practical datacenter CC via deployable INT[EB/OL].[2024-10-11]. https://www.usenix.org/conference/nsdi23/presentation/wang-weitao

[13] JOSHI R, SONG C H, KHOOI X Z, et al. Masking corruption packet losses in datacenter networks with link-local retransmission[C]//Proceedings of the ACM SIGCOMM 2023Conference. ACM, 2023:288-304. DOI:10.1145/3603269.3604853

[14] YAN B, ZHAO Y, XU S, et al. LHCC:low-latency and hiprecision congestion control in RDMA datacenter networks[EB/OL].[2024-10-06]. https://ieeexplore. ieee. org/document/10682889

[15] LIM H, KIM J, CHO I, et al. Flex Pass:a case for flexible creditbased transport for datacenter networks[EB/OL].(2023-05-08)[2024-10-07]. https://dl.acm.org/doi/10.1145/3552326.3587453

[16] SONG C H, KHOOI X Z, JOSHI R, et al. Network load balancing with In-network reordering support for RDMA[C]//Proceedings of the ACM SIGCOMM 2023 Conference. ACM, 2023:816-831.DOI:10.1145/3603269.3604849

[17] HUANG P, ZHANG X, CHEN Z, et al. LEFT:lightw Eight and fast packet reordering for RDMA[EB/OL].(2024-08-03)[2024-10-08]. https://dl.acm.org/doi/abs/10.1145/3663408.3663418

[18] CHEN C, YE J, GAO Y, et al. HF2T:host-based flowlet finetuning for RDMA load balancing[EB/OL].[2024-10-10]. https://dl.acm.org/doi/10.1145/3663408.3663410

[19] DENG H T, YANG Y, ZHANG M, et al. CAVER:enhancing RDMA load balancing by hunting less-congested paths[EB/OL].[2024-10-10]. https://dl.acm.org/doi/10.1145/3672202.3673729

[20] DONG J B, CAO Z, ZHANG T, et al. EFLOPS:algorithm and system co-design for a high performance distributed training platform[C]//Proceedings of IEEE International Symposium on High Performance Computer Architecture(HPCA). IEEE, 2020:610-622. DOI:10.1109/hpca47549.2020.00056

[21] KIM J, DALLY W J, SCOTT S, et al. Technology-driven, highlyscalable dragonfly topology[EB/OL].[2024-10-12]. https://ieeexplore.ieee.org/document/4556717

[22] JOUPPI N P, KURIAN G, LI S, et al. TPU v4:an optically reconfigurable supercomputer for machine learning with hardware support for embeddings[EB/OL].(2023-04-04)[2024-10-05]. https://arxiv.org/abs/2304.01433

[23] IETF. Routing in fat trees(RIFT)working group[EB/OL].[2024-10-05]. https://datatracker.ietf.org/doc/draft-ietf-rift-rift/

[24] NVIDIA. Adaptive routing[EB/OL].[2024-10-05]. https://docs.nvidia. com/networking-ethernet-software/cumulus-netq-48/Monitor-Operations/Monitor-Adaptive-Routing/

[25] OPEN COMPUTE PROJECT 2020. Distributed disaggregated chassis routing system[EB/OL].[2024-04-10]. https://www.opencompute. org/documents/ufispace-dcc-routing-systemintro-for-ocp-summit-2020-1-pdf

[26]段晓东,程伟强,王瑞雪,等.面向新型智能计算中心的全调度以太网技术[J].中兴通讯技术, 2023, 29(4):57-63. DOI:10.12142/ZTETJ.202304011

[27] NVIDIA. SHARP:in-network scalable streaming hierarchical aggregation and reduction protocol[EB/OL].[2024-10-12].https://mug. mvapich. cse. ohio-state. edu/static/media/mug/presentations/20/bureddy-mug-20.pdf

[28] NIVIDA. NIVIDA GB 200 NVL72[EB/OL].[2024-10-13]. https://www.nvidia.com/en-us/data-center/gb200-nvl72/

[29] POUTIEVSKI L, MASHAYEKHI O, ONG J, et al. Jupiter evolving:transforming google’s datacenter network via optical circuit switches and software-defined networking[EB/OL].[2024-10-12]. https://dl.acm.org/doi/pdf/10.1145/3544216.3544265

[30] WANG W Y, GHOBADI M, SHAKERI K, et al. Rail-only:a lowcost high-performance network for training LLMs with trillion parameters[EB/OL].[2024-10-13]. https://ieeexplore. ieee. org/document/10664412

[31] RABINOVITSJ D. Opening AI infrastructure[EB/OL].[2024-10-13]. https://drive. google. com/file/d/1ud1JZqco2868Avmk Nkr AAxp-74Pvw Wx/view

[32] BROADCOM. Why ethernet reigns supreme over Infini Band for large-scale networks[EB/OL].[2024-12-04]. https://docs.broadcom.com/doc/Unleashing-the-Power-of-AI-ML

基本信息:

DOI:

中图分类号:TP393.09;TP18

引用信息:

[1]段威,李和松,周昆.智算中心网络技术发展与应用[J].中兴通讯技术,2024,30(06):39-47.

基金信息:

检 索 高级检索

引用

GB/T 7714-2015 格式引文
MLA格式引文
APA格式引文