Distributed Inference with Deep Learning Models across Heterogeneous Edge Devices
Distributed Inference with Deep Learning Models across Heterogeneous Edge Devices
Abstract-Recent years witnessed an increasing research attention in deploying deep learning models on edge devices for inference. Due to limited capabilities and power constraints, it may be necessary to distribute the inference workload across multiple devices. Existing mechanisms divided the model across edge devices with the assumption that deep learning models are constructed with a chain of layers. In reality, however, modern deep learning models are more complex, involving a directed acyclic graph rather than a chain of layers.
In this paper, we present EdgeFlow, a new distributed inference mechanism designed for general directed acyclic graph structured deep learning models. Specifically, EdgeFlow partitions model layers into independent execution units with a new progressive model partitioning algorithm. By producing near-optimal model partitions, our new algorithm seeks to improve the run-time performance of distributed inference as these partitions are distributed across the edge devices. During inference, EdgeFlow orchestrates the intermediate results flowing through these units to fulfill the complicated layer dependencies. We have implemented EdgeFlow based on PyTorch, and evaluated it with state-of-the-art deep learning models in different structures. The results show that EdgeFlow reducing the inference latency by up to forty point two percent compared with other approaches, which demonstrates the effectiveness of our design.
One. INTRODUCTION
One. INTRODUCTION
As deep learning models are used in a wide variety of tasks such as image recognition, video analysis, and natural language processing, they are typically deployed at remote cloud servers and require users to upload local data for inference, incurring considerable overhead with respect to the time needed for transferring large volumes of data over the Internet. An intuitive solution to reduce such overhead is to "offload" these inference tasks from the cloud server to the edge devices. Unfortunately, edge devices are typically resource-constrained while the inference process is extremely computation-intensive. Directly using a deep learning model for inference on devices with limited computation power may result in an even longer inference time. For this reason, it is desirable to design distributed inference mechanisms that accelerate the inference process by partitioning the workload and distributing them to a cluster of edge devices for cooperative inference.
Though distributed inference has received much attention in the recent literature, existing works generally assume that deep learning models are constructed as a chain of sequentially executed layers. Unfortunately, such an assumption is too simplified to hold with modern deep learning models:
besides stacking deeper layers, structural modifications are also applied to the models to pursue the best performance, such as residual blocks in ResNet and PANet. Generally, modern models are constructed as directed acyclic graphs instead of chains to represent complicated layer dependencies, e.g., a layer may require outputs from multiple preceding layers, or its output has to be fed into multiple subsequent layers.
Such a directed acyclic graph structure comes with new challenges in the distributed deployment of deep learning models. For example, directed acyclic graph structured models require new layers, like upsampling and concatenation, to maintain the consistency of the intermediate results, which are barely discussed in existing distributed inference mechanisms. The execution sequence of the layers is undetermined as there could be parallel paths in the computation graph. Also, the fact that one layer may depend on multiple preceding layers increases the complexity of model partitioning, since the partitioning of a specific layer should consider how its preceding layers are partitioned.
Unfortunately, existing works failed to provide adequate support for directed acyclic graph structured models. Most of them tried to turn a directed acyclic graph back into a chain with two intuitive methods: ignoring the branches and manually fix the dependency; or finding cut points - whose removal adds the connected components of the graph - such that the branchy layers between two cut points can be treated as a single one. Though these methods may work well with simple directed acyclic graph structures like ResNet, they cannot deal with more complex models, such as Yolo V five, a state-of-the-art deep learning model for object detection, which contains no cut points except the input and output layers, and there are so many branches that it will be almost impossible to manually fix the layer dependency.
To address these challenges, this paper introduces EdgeFlow, a new distributed inference mechanism specifically designed for general directed acyclic graph structured deep learning models. Rather than turning a directed acyclic graph into a chain, EdgeFlow breaks layers of the computation graph into a set of execution units, which contain a list of input requirements, the computation operator, and a forwarding table. During the inference, when the required input is ready, the execution unit will apply the computation operator on the input to get an intermediate result. According to the forwarding table, it will send the intermediate result to other units to fulfill their input requirements. Intuitively, the intermediate results flow among execution units that are distributed among edge devices according to the dependency to finish a logically equivalent inference as the original computation graph.
EdgeFlow achieves acceleration by partitioning a layer into multiple independent execution units, such that they can be assigned to different devices for parallel execution. One of the major challenges, therefore, is how these layers should be partitioned. Since we have to decide the partitioning scheme for each layer, the decision space could be too large to find the optimum solution. In addition, the partitioning decision for a specific layer is related to how its preceding layers are partitioned, which further adds complexity. Considering both the run-time efficiency and partition optimality, we propose a new progressive algorithm that partitions the computation graph layer by layer in topological order, such that we can optimally partition the workload of a specific layer when the partition schemes for its preceding layers are fixed.
Though such a partitioning problem can be formulated as an integer programming problem, such a formulation may not be practical to solve as solving integer programming problems is NP-hard in general. In this context, another highlight in our contributions is that we are able to transform such an integer programming problem to a linear programming problem as an approximation, such that it can be solved to optimality efficiently, without resorting to heuristics. Our transformed linear programming problem can then be solved efficiently using off-the-shelf linear programming solvers. We have implemented EdgeFlow with PyTorch, and evaluate it on various deep learning model structures, including the latest Yolo V five model. Our comparison results show that EdgeFlow outperforms the latest distributed inference works, reducing the inference latency by up to forty point two percent.