Adaptive and Resilient Model-Distributed Inference in Edge Computing Systems
Adaptive and Resilient Model-Distributed Inference in Edge Computing Systems
ABSTRACT The traditional approach to distributed deep neural network inference in edge computing systems is data-distributed inference. In this paradigm, each worker has a pre-trained DNN model. Using the DNN model, the worker processes the data that is offloaded to itself. The data-distributed inference approach has high communication cost especially when the size of data is large, and it is not efficient in terms of memory as the whole model should be stored and computed in each worker. Model-distributed inference is emerging as a promising solution, where a DNN model is distributed across workers. Although there is a huge amount of work on model-distributed training, the benefit of model distribution for inference is not understood well. In this paper, we analyze the potential of model-distributed inference in edge computing systems. Then, we develop an Adaptive and Resilient Model-Distributed Inference algorithm based on our optimal model allocation formulation. AR-MDI performs model allocation in a lightweight and decentralized way and it is resilient against delayed workers and failures. We implement AR-MDI in a real testbed consisting of NVIDIA Jetson TX two's and show that AR-MDI improves the inference time significantly as compared to baselines when the size of data is large, such as ImageNet.
One. INTRODUCTION
One. INTRODUCTION
Modern edge devices such as drones, autonomous robots, sensors, and self-driving cars are generating data at tremendous rates. Many applications that execute on these devices are delay-sensitive, meaning that the data generated by the applications should be processed as quickly as possible. For this purpose, transmission of the generated data to a remote cloud may be unacceptable due to transmission delays. Thus, data should be processed near its place of origin, i.e., on or near the edge. One complication in this context is that the edge devices are typically limited in terms of computation power, energy, and/or memory. Hence, the design of high-performance distributed data processing methods is crucial.
The traditional approach to distributed deep neural network inference is data-distributed inference, which partitions and distributes data to workers as illustrated in Figure one. The workers are comprised of edge servers, end users, and/or remote cloud if available. An end user, which would like to classify input data, offloads data to workers for classification. The end user itself could function as one of the workers by processing some of its own data. Each worker keeps a pre-trained DNN model, processes the offloaded data, and sends the output back to the end user. This approach, although very straightforward, has two disadvantages: Communication cost is high especially when input data size is large i.e., high resolution data; and Each worker should store the whole model, which puts a strain especially on end user devices.
Model-distributed inference also called model parallelism is emerging as a promising solution, where a DNN model is distributed across workers, Figure two. The end user, which has input data, may process a few layers of a DNN model, and transmits the activation vector of its last layer to a neighboring node. The neighboring node receives an activation vector and performs the calculations of the layers that are assigned to it. Finally, the worker that calculates the last layers of the DNN model obtains and sends the output back to the end user that has the input data. We note that the workers perform parallel processing in this setup by pipelining as further explained in Section three.
Although there is a huge amount of work on model-distributed training, the benefit of model distribution for inference is not understood well. The potential of model distribution for training is obvious. Indeed, it is indispensable in data-distributed training to exchange the whole model among workers and a model aggregator parameter server for every batch of data, which introduces huge amount of communication cost. On the other hand, thanks to distributing the DNN model, model-distributed training requires to exchange only activation vectors among workers, not the whole model. Thus, model-distributed training reduces the communication cost as compared to data-distributed training.
The potential of model distributed inference in terms of reducing the communication cost is less obvious. While data-distributed inference requires the exchange of actual data Figure one, model-distributed inference needs to exchange activation vectors. We observed that when the size of data is large, exchanging the actual data introduces higher communication cost, which makes model-distributed inference plausible. Building on this observation, we analyze the potential of model-distributed inference as compared to data-distributed inference in a homogeneous setup, where all workers have the same amount of computing power.
It is crucial to exploit the potential of model-distributed inference in a heterogeneous and dynamic setup, where the computing power of workers may be different and change over time. A model partitioning mechanism based on dynamic programming is proposed for this purpose. However, this approach introduces too much computing cost to determine the optimal model allocation. Also, it is not adaptive to time-varying resources. Instead, we design a lightweight, adaptive, and decentralized model allocation mechanism, which we name Adaptive and Resilient Model-Distributed Inference based on the solution to our optimal model-allocation formulation.
One of the weaknesses of model distribution as compared to data distribution is its vulnerability to failing workers. For example, if one of the workers in Figure two fails, the whole system fails. Thus, we design a recovery mechanism as part of our AR-MDI algorithm. The recovery mechanism of AR-MDI is inspired by the peer management mechanism of Chord-like P two P systems as further detailed in Section five. The following are the key contributions of this work:
We provide inference time analysis for both model- and data-distributed inference in a homogeneous setup, and show that model-distributed inference has smaller inference time if the size of input data is large.
We formulate a model-allocation problem across workers for model-distributed inference in a heterogeneous setup. Building on the solution to the optimization problem, we design a lightweight, adaptive, and decentralized model allocation algorithm, which we name Adaptive and Resilient Model-Distributed Inference algorithm.
We fortify our AR-MDI algorithm with a recovery mechanism against delayed and failing workers.
We implemented AR-MDI as well as baselines; EdgePipe and Data-Distributed Inference in a heterogeneous testbed of NVIDIA Jetson TX two cards. Our experiments including CIFAR ten and ImageNet datasets, and VGG sixteen and MobileNetV two DNN models show that AR-MDI significantly reduces the data inference time as compared to the baselines.
The structure of the rest of this paper is as follows. Section two presents the related work. Section three introduces our system model and provides preliminaries on model-distributed inference. Section four analyzes the potential of model-distributed inference for the case of homogeneous transmission delays and worker computing powers. We formulate an optimal model allocation problem for the heterogeneous setup, and design our Adaptive and Resilient Model-Distributed Inference algorithm based on the structure of the optimal solution in Section five. In Section six, we provide experimental results on a real-life testbed. Section seven concludes the paper.