DNN Partitioning for Cooperative Inference in Edge Intelligence: Modeling, Solutions, Toolchains

100%

DNN Partitioning for Cooperative Inference in Edge Intelligence: Modeling, Solutions, Toolchains

With rapid advancements in artificial intelligence and Internet of Things technologies, the deployment of deep neural network models on the edge nodes and the end nodes has become an essential trend. However, the limited computational power, storage capacity, and resource constraints of these devices present significant challenges for deep learning inference. Traditional acceleration methods, such as model compression and hardware optimization, often struggle to balance real-time performance, accuracy, and cost-effectiveness. To address these challenges, collaborative inference through DNN partitioning has emerged as a promising solution. This article provides a comprehensive overview of architectural frameworks for DNN partitioning in collaborative inference. We establish a unified mathematical framework to describe various architectures, DNN models, and their associated optimization problems. In addition, we systematically classify and analyze existing partitioning strategies based on partition count and granularity. Furthermore, we summarize commonly used experimental setups and tools, offering practical insight into implementation. Finally, we discuss key challenges and open issues in DNN partitioning for collaborative inference, such as ensuring data security and privacy and efficiently partitioning large-scale models, providing valuable guidance for future research.

One Introduction

With rapid advances in artificial intelligence and Internet of Things, deep neural networks have been widely applied in fields such as image recognition, natural language processing, autonomous driving, and intelligent healthcare. Deep neural networks excel in classification, prediction, and decision-making due to their high accuracy. However, increasing task complexity demands larger models with more parameters and deeper architectures, leading to greater computational and storage requirements. For example, AlexNet achieved improved image recognition accuracy with eight layers and sixty million parameters; VGG-sixteen employed three by three convolutions, extending to sixteen layers and one hundred forty million parameters; GPT-three has one hundred seventy-five billion parameters, requiring over seven hundred gigabytes of memory and three thousand five hundred eighty teraflops per inference.

Despite improvements in accuracy and efficiency, deep neural networks' computational and storage demands continue to grow. As deep neural network applications expand, inference bottlenecks intensify, especially in resource-limited edge environments with constrained computation, storage, and energy. In real-time applications like autonomous driving and video analysis, the heavy computation often exceeds device capacities, causing latency and performance degradation. Therefore, optimizing inference efficiency while preserving accuracy is crucial. Various acceleration techniques have been proposed in academia and industry; these methods are summarized as follows:

One. Model Optimization: Various techniques have been proposed to reduce deep neural network complexity while maintaining accuracy, such as quantization, pruning, knowledge distillation, and early-exit mechanisms.

Two. Hardware Acceleration: Inference efficiency is further boosted through specialized hardware including graphics processing units, tensor processing units, field-programmable gate arrays, and application-specific integrated circuits.

Three. DNN Partitioning: Deep neural network partitioning reduces inference latency by leveraging the architecture's inter-layer connectivity and output size reduction. For instance, Tiny YOLOv-two's input is zero point ninety-five megabytes, while its max five layer output is only zero point zero eight megabytes-a ninety-three percent decrease. This size reduction minimizes data transmission and eases network bandwidth requirements. Partitioning distributes layers or sub-layers across nodes, balancing computation and communication through pipelined execution. As shown in Figure one, Node one (resource-constrained, e.g., end or edge device) and Node two (more powerful, e.g., edge server or cloud) are linked via wired or wireless connections. The initial layers run on Node one, with subsequent, heavier layers offloaded to Node two. Intermediate feature maps are transmitted between nodes following the model's layerwise dependencies. By placing the partition where feature maps shrink significantly, transmission overhead is minimized without sacrificing accuracy, effectively balancing computation and communication costs.

While both model optimization and deep neural network partitioning target neural network processing, they differ fundamentally. Model optimization reduces model complexity by modifying the network structure through pruning, quantization, or knowledge distillation, often requiring extensive retraining and risking performance loss in complex tasks. Such compressed models are typically platform- or dataset-specific, limiting generalizability. Hardware acceleration improves speed and energy efficiency via specialized hardware but faces deployment constraints in edge environments due to cost and power. Conversely, deep neural network partitioning retains the original model architecture, optimizing execution by distributing computation and communication across nodes (cloud, edge, end devices) based on their capabilities, making it well-suited for distributed, resource-limited scenarios that demand accuracy preservation.

One point one Comparison and Our Contribution

One point two The Organization of the Survey

Two DNN Collaborative Inference: Model Structure, Scenario Architecture, and Optimization Issues

Two point two Collaborative Inference Architecture in Edge Computing

Two point three Optimization Problems

Three. Method

Three point one. One-Dimensional Solution Space: Single-Layer Solution Space

Three point two. Two-Dimensional Solution Space: Single-Layer and Model Optimization Solution Space

Three point three. Two-Dimensional Solution Space: Single-Layer and Resource Allocation Solution Space

Three point four. Two-Dimensional Solution Space: Multiple-Layer/Sub-Layer and Offloading Decision Solution Space

Three point six. One-Dimensional Solution Space: Tensor Parallelism Based on DNN Sub-Layers

Four. Experimental Setup and Commonly Used Tools

Four point one. Summary of DNNs, Associated Tools, and Datasets

Four point two. Summary of Computing Nodes and Communication Tools

Four point three. Summary of Resource Control Tools

Four point four. Summary of Parameter Analysis and Measurement Tools

Five. Challenges and Open Issues

Five point two. Robustness

Five point three. Partitioning of Large-Scale Models

Six. Conclusion

Overview

The article provides an in-depth overview of DNN partitioning for collaborative inference in edge intelligence, addressing challenges in deployment due to limited computational resources. It reviews existing approaches and proposes a unified framework for optimizing collaborative inference strategies.

Key Points

1DNN partitioning can improve efficiency in edge intelligence applications
2The study presents a unified mathematical framework for various DNN architectures
3It classifies existing partitioning strategies according to granularity and count
4Key challenges include data security, privacy, and model partitioning efficiency
5The work also reviews Transformer-specific partitioning methods and their unique considerations.

Details

Authors: Yuntao Hao, Nan Ding, Weiguo Xia, Hongwei Ge, Li Xu
Category: Technology and Engineering

PDF
KarGO: A Smarter Mobile Platform for Tricycle Transportation
KarGO is a mobile platform designed to optimize tricycle transportation in the Philippines, making it easier for users to book rides and helping registered drivers find more passengers, while ensuring safety and convenience through technology.
PDF
KarGO: A Smarter Transportation Solution for Tricycles
This document introduces KarGO, a mobile platform designed to improve the tricycle transportation experience for passengers and drivers in the Philippines. It outlines how users can book rides or deliveries and emphasizes the convenience and safety features of the app.
PDF
KarGO: A Smarter Way to Move Your Community
KarGO is a mobile platform designed to improve transportation for passengers and tricycle drivers in the Philippines, allowing users to book rides, track trips in real-time, and utilize cashless payments.
PDF
Introducing KarGO: A Smarter Transportation Solution for Tricyle Services
KarGO is a mobile platform designed to streamline tricycle transportation in the Philippines, allowing passengers to easily book rides and drivers to find more opportunities. The platform enhances safety for school transportation with real-time GPS tracking and facilitates cashless transactions.
PDF
Cognitive Edge Computing: A Comprehensive Survey on Optimizing Large Models and AI Agents for Pervasive Deployment
This comprehensive survey explores Cognitive Edge Computing as a methodology for deploying advanced AI models and agents on resource-constrained edge devices. It examines model optimization, system architecture, and adaptive intelligence necessary for effective cognitive processing in such environments.