Joint Partitioning and Placement of Foundation Models for Real-Time Edge AI

100%

Joint Partitioning and Placement of Foundation Models for Real-Time Edge AI

Abstract-Inference over large-scale foundation models within heterogeneous edge environments necessitates a fundamentally reconfigurable orchestration substrate. Static partitioning of model layers presumes temporal stability across compute and network resources, which is misaligned with the volatility of real-world deployments. We introduce a framework in which both the spatial placement and internal segmentation of foundation models are elevated to runtime-resolved constructs. The orchestration problem is formalized as a constrained optimization over layer-wise assignments, subject to evolving latency, utilization, and privacy gradients. The framework implements reactive inference composition responsive to infrastructural fluctuations by integrating model-aware capacity profiling with dynamic graph re-partitioning and reallocation. We introduce architectural and algorithmic components, along with a representative use case in six G multi-access edge computing.

One. INTRODUCTION

Next-generation six G-enabled networks will need to support a multitude of AI services based on large foundation models, such as transformer-based large language models. However, deploying these for inference requires significant compute, making their adoption challenging for edge environments. Distributed split inference has emerged as a promising approach to alleviate the computational burden. It partitions an LFM into multiple segments that are executed sequentially across different nodes, e.g., some executed locally, while heavier segments can be outsourced. However, such model splits are mostly static and predetermined before execution, thus lacking adaptability to dynamic and heterogeneous operational conditions such as fluctuating network reliability or changing node utilization. Consequently, these approaches lead to suboptimal performance, compromising latency, resource utilization, and quality of service guarantees, especially in mission-critical applications. This problem becomes even more acute in resource-constrained and heterogeneous edge environments, where multiple users rely on accessing shared resources such as multi-access edge compute, and where data privacy regulations often restrict offloading computations to the cloud. Such volatility renders any a priori static split untenable.

In addition, existing inference orchestration frameworks, such as Kubernetes, Ray Serve, InferLine, and KubeEdge, excel at container or micro-batch scheduling but treat large foundation models primarily as black boxes, thus lacking mechanisms for runtime layer re-partitioning or privacy-aware placement. Consequently, the key problem of joint model-aware partition and placement under dynamic edge conditions remains open.

In this work, we address this problem by introducing an adaptive split inference orchestration framework, extending existing workload orchestrators with domain-specific capabilities specifically tailored for large foundation models, such as multi-modal large language models. We introduce the following capabilities by leveraging the modular architecture of these models:

One) Distribution of workloads to edge nodes with better performance or capacity than the original source node.

Two) Relocation of split large foundation model segments to dynamically optimize resources under changing compute conditions.

Three) Dynamic large foundation model re-splitting to further improve performance and resource utilization when required.

Unlike general-purpose workload schedulers, our framework operates on the computational graph of the large foundation model itself, allowing decisions at the granularity of, e.g., individual transformer blocks. This enables quality of service-driven re-splitting that commodity orchestrators do not address. Furthermore, as our solution emphasizes split inference, privacy can be implemented as an additional feature at no cost if sensitive large foundation model layers can be executed locally, which makes reverse engineering data from model weights significantly more challenging for attackers. Through this approach, we establish a foundation for real-time and quality of service-aware large foundation model inference in edge networks, aligning with key six G objectives of seamless connectivity, low inference latency, and intelligent edge resource management.

Two. BACKGROUND

B. Distributed and Adaptive Split Inference

C. Key Design Goals of Adaptive Large Foundation Model Split Inference

Three. Proposed Adaptive Orchestration Framework

A. Reference Architecture

B. Notation and System Model

C. Orchestration Workflow

Algorithm one: Adaptive Split Orchestration Workflow

D. Privacy and Security Considerations

IV. EXPECTED PERFORMANCE RESULTS

V. CONCLUSIONS

Overview

The paper introduces a framework for real-time edge AI that enhances the orchestration of large foundation models through joint partitioning and placement. It focuses on optimizing inference performance by allowing for dynamic adaptations to resource availability and privacy requirements.

Key Points

1An adaptive orchestration framework is essential for optimizing large foundation model inference in edge environments
2Static partitioning methods fail to accommodate the dynamic nature of edge computing
3The framework incorporates real-time profiling of resources to enhance performance and privacy
4Quality of service and service-level agreements are prioritized in the orchestration design
5The proposed solution allows for intelligent reallocation of model segments based on operational conditions

Details

Authors: Aladin Djuhera, Fernando Kocht, Alecio Binotto
Category: Technology and Engineering

PDF
KarGO: A Smarter Mobile Platform for Tricycle Transportation
KarGO is a mobile platform designed to optimize tricycle transportation in the Philippines, making it easier for users to book rides and helping registered drivers find more passengers, while ensuring safety and convenience through technology.
PDF
KarGO: A Smarter Transportation Solution for Tricycles
This document introduces KarGO, a mobile platform designed to improve the tricycle transportation experience for passengers and drivers in the Philippines. It outlines how users can book rides or deliveries and emphasizes the convenience and safety features of the app.
PDF
KarGO: A Smarter Way to Move Your Community
KarGO is a mobile platform designed to improve transportation for passengers and tricycle drivers in the Philippines, allowing users to book rides, track trips in real-time, and utilize cashless payments.
PDF
Introducing KarGO: A Smarter Transportation Solution for Tricyle Services
KarGO is a mobile platform designed to streamline tricycle transportation in the Philippines, allowing passengers to easily book rides and drivers to find more opportunities. The platform enhances safety for school transportation with real-time GPS tracking and facilitates cashless transactions.
PDF
Cognitive Edge Computing: A Comprehensive Survey on Optimizing Large Models and AI Agents for Pervasive Deployment
This comprehensive survey explores Cognitive Edge Computing as a methodology for deploying advanced AI models and agents on resource-constrained edge devices. It examines model optimization, system architecture, and adaptive intelligence necessary for effective cognitive processing in such environments.