15% AI Training Speedup Enabling Comprehensive View with End-Point Decoupled Full-Flow Scheduling
Imagine urban rush-hour traffic gridlocked by a set of rigid traffic light systems that ignore fundamental patterns.
During morning rush hour, all traffic surges toward the city center, main roads clogged deep red; during evening rush hour, the flow pours out en masse. A rigid timing scheme reduces road efficiency to zero.

At this moment, in a large model training cluster, a digital-world "traffic paralysis" is playing out in real time. The monitoring screen shows that among the 8×400G links, 3 are deep red and congested, 2 are yellow and slow, while the other 3 remain nearly idle. Model iteration time has consequently slowed from 4.2 seconds to 6 seconds. Predictable AI traffic has been choked to a standstill by mechanical scheduling.
The network traffic generated by large model training, especially All-Reduce, is akin to regular "tidal traffic flow"—highly patterned, highly predictable, and characterized by large per-flow bandwidth. However, the ECMP scheduling mechanism of traditional networks acts like a rigid traffic light system that only recognizes license plate numbers (five-tuple) but understands nothing of tidal patterns. This structural conflict between the two makes data collisions and queuing at critical intersections (hotspot links) inevitable.
Traditional networks are built on the ideal assumption of "random traffic, balanced hashing," but the reality of AI training is "patterned traffic, hash polarization." When deterministic business demands meet infrastructure scheduling that is rigid and completely unaware of traffic intent, congestion and performance bottlenecks on critical paths become inevitable. Compounding the issue, traditional sampling-based monitoring has inherent flaws: incomplete collection, inaccurate reconstruction, and delayed analysis. Operations personnel navigate in the dark, unable to grasp the real-time dynamics of the entire network traffic.
Breakthrough: End-Point Decoupled Full-Flow Path Navigation Solution
To solve this problem, the network must evolve from "blind command" to "intelligent scheduling." To this end, we integrate Intelligent Computing Switches (vehicles) with the AD-DC Intelligent Computing Edition (intelligent scheduling system) to form an end-point decoupled full-flow path navigation holistic solution. Through a closed-loop technology system of "See Fully, See Accurately, Understand Clearly, Optimize Well," a fundamental transformation is achieved.

See Fully: Eliminating Sampling Blind Spots, Full-Flow Capture
Facing massive, high-speed RoCE data flows, traditional sampling monitoring has become completely inadequate. Leveraging Telemetry Stream technology, full traffic feature mirroring is implemented on the switch side. The AD-DC Intelligent Computing collector aggregates and analyzes this data, establishing a comprehensive "digital profile" for every single data flow, making the real-time panoramic view of the entire network traffic visible.

See Accurately: From Data to Insight, Precise Quantitative Analysis
Building on the full data, the AD-DC Intelligent Computing analyzer conducts session-level deep analysis. It not only fully reconstructs the path and timing of each flow, but, crucially, quantifies core performance metrics such as flow completion time, effective throughput, and retransmission rate. By integrating latency and precise path tracing, it constructs a clear, measurable, and actionable panoramic performance view for the upstream intelligent scheduling system.
Understand Clearly: Timeslice Modeling, Comprehending Business Intent
Traditional switch ECMP hashing is a "local perspective" decision: each device can only make optimal choices based on its own limited information, but cannot foresee the eventual aggregation and collisions of traffic across the entire network. In contrast, the AD-DC Intelligent Computing timeslice modeling provides the global insight of a comprehensive view.
This is the core of intelligent scheduling. The system utilizes timeslice modeling technology to gain insight into the relationship between traffic in time and space:

Identify: Precisely distinguish between parallel flows that "simultaneously compete for links" and serial flows that "can be shared via staggered scheduling."
Predict: Like an actuary, quantitatively calculate the future idle time window of each link and foresee the risk of traffic collisions.
This means the system can not only answer "where the traffic is now," but also "where all traffic flows are headed and whether they will meet"—a fundamental leap from passive monitoring to active awareness and control.
Optimize Well: Comprehensive Dynamic Path Selection, Achieving Transparent Scheduling

Based on precise "understanding," the system performs dynamic intelligent path selection:
Dynamic Scheduling: For parallel flows "prone to collision," guides them in real-time to the idlest paths; for serial flows that "can be staggered," intelligently arranges shared path usage.
Global Optimum: Each path possesses a real-time "dynamic weight," constantly reflecting its busyness level, ensuring every decision is based on the most authentic state of the entire network.
This entire intelligent scheduling process is automatically completed within an autonomous "Sense-Analyze-Decide-Execute" closed loop, ensuring continuously optimized results.
Advantage: End-Point Decoupling, Enabling Intelligent, Transparent Implementation
The core competitiveness of this solution lies in its "end-point decoupling" architecture: all intelligent scheduling logic is executed independently on the network side, achieving smooth, "zero-intrusion, transparent" implementation for computing services. Its value is manifested in four aspects:
Broad Compatibility: Independent of any specific hardware (GPU/NIC) or AI framework; no modifications required to drivers, business code, or applications.
Deployment Transparency: Network-side configuration completed within minutes; no need to restart training tasks or interrupt services; smooth go-live.
Scheduling Transparency: Guides traffic by dynamically adjusting standard routing policies, completely transparent to servers and applications.
Closed-Loop Self-Optimization: Forms an autonomous "Sense-Decide-Execute-Verify" loop based on real-time data, continuously optimizing scheduling effectiveness.
Validation: Visible Efficiency Improvement
Actual deployment data proves that this solution brings fundamental improvement to AI training networks: effective network bandwidth increased from under 60% to over 95%, average iteration time reduced by 15%, all achieved with zero intrusion to business operations throughout the process.
This means that, with the "End-Point Decoupled Full-Flow Navigation" solution, the "network congestion" caused by hash collisions in AI training is systematically resolved. When the network can schedule itself autonomously, customers gain faster model iteration, more stable cluster operation, and higher returns on their computing investments.
