DNN Partitioning for Cooperative Inference in Edge Intelligence: Modeling, Solutions, Toolchains
DNN Partitioning for Cooperative Inference in Edge Intelligence: Modeling, Solutions, Toolchains
With rapid advancements in artificial intelligence and Internet of Things technologies, the deployment of deep neural network models on the edge nodes and the end nodes has become an essential trend. However, the limited computational power, storage capacity, and resource constraints of these devices present significant challenges for deep learning inference. Traditional acceleration methods, such as model compression and hardware optimization, often struggle to balance real-time performance, accuracy, and cost-effectiveness. To address these challenges, collaborative inference through DNN partitioning has emerged as a promising solution. This article provides a comprehensive overview of architectural frameworks for DNN partitioning in collaborative inference. We establish a unified mathematical framework to describe various architectures, DNN models, and their associated optimization problems. In addition, we systematically classify and analyze existing partitioning strategies based on partition count and granularity. Furthermore, we summarize commonly used experimental setups and tools, offering practical insight into implementation. Finally, we discuss key challenges and open issues in DNN partitioning for collaborative inference, such as ensuring data security and privacy and efficiently partitioning large-scale models, providing valuable guidance for future research.
One Introduction
One Introduction
With rapid advances in artificial intelligence and Internet of Things, deep neural networks have been widely applied in fields such as image recognition, natural language processing, autonomous driving, and intelligent healthcare. Deep neural networks excel in classification, prediction, and decision-making due to their high accuracy. However, increasing task complexity demands larger models with more parameters and deeper architectures, leading to greater computational and storage requirements. For example, AlexNet achieved improved image recognition accuracy with eight layers and sixty million parameters; VGG-sixteen employed three by three convolutions, extending to sixteen layers and one hundred forty million parameters; GPT-three has one hundred seventy-five billion parameters, requiring over seven hundred gigabytes of memory and three thousand five hundred eighty teraflops per inference.
Despite improvements in accuracy and efficiency, deep neural networks' computational and storage demands continue to grow. As deep neural network applications expand, inference bottlenecks intensify, especially in resource-limited edge environments with constrained computation, storage, and energy. In real-time applications like autonomous driving and video analysis, the heavy computation often exceeds device capacities, causing latency and performance degradation. Therefore, optimizing inference efficiency while preserving accuracy is crucial. Various acceleration techniques have been proposed in academia and industry; these methods are summarized as follows:
One. Model Optimization: Various techniques have been proposed to reduce deep neural network complexity while maintaining accuracy, such as quantization, pruning, knowledge distillation, and early-exit mechanisms.
Two. Hardware Acceleration: Inference efficiency is further boosted through specialized hardware including graphics processing units, tensor processing units, field-programmable gate arrays, and application-specific integrated circuits.
Three. DNN Partitioning: Deep neural network partitioning reduces inference latency by leveraging the architecture's inter-layer connectivity and output size reduction. For instance, Tiny YOLOv-two's input is zero point ninety-five megabytes, while its max five layer output is only zero point zero eight megabytes-a ninety-three percent decrease. This size reduction minimizes data transmission and eases network bandwidth requirements. Partitioning distributes layers or sub-layers across nodes, balancing computation and communication through pipelined execution. As shown in Figure one, Node one (resource-constrained, e.g., end or edge device) and Node two (more powerful, e.g., edge server or cloud) are linked via wired or wireless connections. The initial layers run on Node one, with subsequent, heavier layers offloaded to Node two. Intermediate feature maps are transmitted between nodes following the model's layerwise dependencies. By placing the partition where feature maps shrink significantly, transmission overhead is minimized without sacrificing accuracy, effectively balancing computation and communication costs.
While both model optimization and deep neural network partitioning target neural network processing, they differ fundamentally. Model optimization reduces model complexity by modifying the network structure through pruning, quantization, or knowledge distillation, often requiring extensive retraining and risking performance loss in complex tasks. Such compressed models are typically platform- or dataset-specific, limiting generalizability. Hardware acceleration improves speed and energy efficiency via specialized hardware but faces deployment constraints in edge environments due to cost and power. Conversely, deep neural network partitioning retains the original model architecture, optimizing execution by distributing computation and communication across nodes (cloud, edge, end devices) based on their capabilities, making it well-suited for distributed, resource-limited scenarios that demand accuracy preservation.