Joint Partitioning and Placement of Foundation Models for Real-Time Edge AI
Joint Partitioning and Placement of Foundation Models for Real-Time Edge AI
Abstract-Inference over large-scale foundation models within heterogeneous edge environments necessitates a fundamentally reconfigurable orchestration substrate. Static partitioning of model layers presumes temporal stability across compute and network resources, which is misaligned with the volatility of real-world deployments. We introduce a framework in which both the spatial placement and internal segmentation of foundation models are elevated to runtime-resolved constructs. The orchestration problem is formalized as a constrained optimization over layer-wise assignments, subject to evolving latency, utilization, and privacy gradients. The framework implements reactive inference composition responsive to infrastructural fluctuations by integrating model-aware capacity profiling with dynamic graph re-partitioning and reallocation. We introduce architectural and algorithmic components, along with a representative use case in six G multi-access edge computing.
One. INTRODUCTION
One. INTRODUCTION
Next-generation six G-enabled networks will need to support a multitude of AI services based on large foundation models, such as transformer-based large language models. However, deploying these for inference requires significant compute, making their adoption challenging for edge environments. Distributed split inference has emerged as a promising approach to alleviate the computational burden. It partitions an LFM into multiple segments that are executed sequentially across different nodes, e.g., some executed locally, while heavier segments can be outsourced. However, such model splits are mostly static and predetermined before execution, thus lacking adaptability to dynamic and heterogeneous operational conditions such as fluctuating network reliability or changing node utilization. Consequently, these approaches lead to suboptimal performance, compromising latency, resource utilization, and quality of service guarantees, especially in mission-critical applications. This problem becomes even more acute in resource-constrained and heterogeneous edge environments, where multiple users rely on accessing shared resources such as multi-access edge compute, and where data privacy regulations often restrict offloading computations to the cloud. Such volatility renders any a priori static split untenable.
In addition, existing inference orchestration frameworks, such as Kubernetes, Ray Serve, InferLine, and KubeEdge, excel at container or micro-batch scheduling but treat large foundation models primarily as black boxes, thus lacking mechanisms for runtime layer re-partitioning or privacy-aware placement. Consequently, the key problem of joint model-aware partition and placement under dynamic edge conditions remains open.
In this work, we address this problem by introducing an adaptive split inference orchestration framework, extending existing workload orchestrators with domain-specific capabilities specifically tailored for large foundation models, such as multi-modal large language models. We introduce the following capabilities by leveraging the modular architecture of these models:
One) Distribution of workloads to edge nodes with better performance or capacity than the original source node.
Two) Relocation of split large foundation model segments to dynamically optimize resources under changing compute conditions.
Three) Dynamic large foundation model re-splitting to further improve performance and resource utilization when required.
Unlike general-purpose workload schedulers, our framework operates on the computational graph of the large foundation model itself, allowing decisions at the granularity of, e.g., individual transformer blocks. This enables quality of service-driven re-splitting that commodity orchestrators do not address. Furthermore, as our solution emphasizes split inference, privacy can be implemented as an additional feature at no cost if sensitive large foundation model layers can be executed locally, which makes reverse engineering data from model weights significantly more challenging for attackers. Through this approach, we establish a foundation for real-time and quality of service-aware large foundation model inference in edge networks, aligning with key six G objectives of seamless connectivity, low inference latency, and intelligent edge resource management.