One Chip, Many Cores: Firmware Orchestration Across CPU, DSP, and NPU in Heterogeneous SoCs

 Firmware Orchestration

Modern edge SoCs pack more compute diversity onto a single die than most firmware teams know how to orchestrate. A chip like the Texas Instruments TDA4VM for automotive applications integrates dual Cortex-A72 application cores, a cluster of Cortex-R5F real-time cores, C7x DSP cores for signal processing, and an MMA matrix multiply accelerator for neural network inference — all on one piece of silicon. NXP's i.MX8M Plus adds a dedicated NPU to a Cortex-A53 cluster. The Qualcomm Hexagon processor combines a scalar CPU, a vector DSP, and an NPU tensor accelerator in a single programmable architecture. Rockchip's RK3588 adds a 6 TOPS NPU alongside its A76/A55 CPU complex.

The hardware integration is solved. The software orchestration problem is not. Each processor class on a heterogeneous SoC has its own instruction set, its own memory access model, its own toolchain, and its own execution model. Getting a workload that spans signal acquisition on a DSP, inference on an NPU, and result handling on a Linux application processor to execute correctly, efficiently, and without interfering with other concurrent workloads is a firmware architecture problem with no single standard answer. Teams that treat a heterogeneous SoC as a fast single-core processor and route everything through the main CPU leave the specialized accelerators largely idle, waste power, and deliver inferior latency. Teams that distribute workloads across the available processors without designing the inter-processor communication and lifecycle management carefully produce systems that are unreliable and nearly impossible to debug.

What Each Processor Class Brings to the System

Before discussing orchestration, it is worth being precise about what each processor type in a heterogeneous SoC is optimized for, because workload assignment errors — running the wrong task on the wrong core — are the most common source of inefficiency on these platforms.

The application CPU cluster (typically Cortex-A53/A72/A78 cores) runs a general-purpose OS, handles complex control flow, manages memory-mapped peripherals, runs network stacks, and executes application logic that benefits from the OS's services: file system access, TCP/IP, dynamic memory allocation, process isolation. The CPU is the worst choice for any workload that requires tight, regular timing — its caches, out-of-order execution, and OS scheduling introduce latency variability that defeats determinism.

The real-time CPU cluster (Cortex-R5, Cortex-M33, RISC-V RV32 cores) runs an RTOS or bare-metal firmware with deterministic scheduling, handles hard real-time I/O, runs safety-critical control loops, and responds to hardware interrupts within bounded latency. These cores have no instruction cache speculation, no out-of-order execution, and are often isolated from the main memory bus to protect their timing. They are the wrong choice for compute-intensive workloads where their single-issue, low-clock-speed architecture is outperformed by a CPU core doing the same work.

The DSP (C64x, C7x, Hexagon vector) executes vectorized signal processing — FFT, FIR filtering, audio codec operations, radar signal chains — at high throughput with explicit SIMD operations and software-managed memory hierarchy. DSP programming requires manual vectorization and DMA management that is invisible to a C compiler targeting a general-purpose CPU. DSPs excel at throughput-bound, data-parallel workloads with predictable memory access patterns. They are poorly suited for workloads with complex control flow, dynamic memory allocation, or irregular access patterns.

The NPU executes neural network inference against a fixed computational graph: matrix multiplications, convolutions, activation functions, pooling — the operations that constitute deep learning models. NPU efficiency depends on loading the model weights into on-chip scratchpad, streaming input tensors through the compute array, and reading output tensors with minimal DRAM access. NPUs achieve their performance advantage only on workloads that match this structure. They are not general-purpose accelerators; an NPU asked to run a task that does not map to its MAC array architecture will deliver poor utilization and potentially worse performance than the CPU.

Correct workload assignment is the first orchestration decision. On a TDA4VM processing a radar signal chain, the ADC data arrives on a dedicated capture peripheral, the DSP performs the range-FFT and Doppler-FFT, the MMA/NPU runs object detection inference on the processed radar cube, and the Cortex-A handles tracking, system state management, and communication. Each stage runs on the core it is best suited for. The orchestration layer exists to connect these stages efficiently.

Asymmetric Multiprocessing — The Software Model for Heterogeneous Cores

The foundational software model for heterogeneous SoCs is Asymmetric Multiprocessing (AMP). Unlike Symmetric Multiprocessing (SMP), where multiple identical cores run a single OS instance and share a unified view of memory and scheduling, AMP operates independent software contexts on different processor types — Linux on the Cortex-A cluster, FreeRTOS or Zephyr on the Cortex-R5, and vendor-specific RTOS on the DSP — each with their own execution environment, and with inter-processor communication managed explicitly through shared memory and hardware mailboxes.

The two infrastructural pieces that make AMP manageable in Linux-based heterogeneous systems are remoteproc and RPMsg, both part of the Linux kernel's standard driver stack.

Remoteproc handles the lifecycle management of the co-processors from the perspective of the Linux host processor. It loads the firmware ELF binary to the appropriate memory region, releases the co-processor from reset, monitors its state, and shuts it down when its services are no longer needed. The remote firmware binary contains a resource table — a structured section specifying the memory regions it needs, the inter-core notification mechanisms it uses, and the virtual devices it supports. Remoteproc parses this resource table, allocates the required physically contiguous memory, configures the IOMMU, and establishes the conditions the co-processor firmware declared it needs before starting it. The sysfs interface (/sys/class/remoteproc/remoteproc0/state, /sys/class/remoteproc/remoteproc0/firmware) exposes this lifecycle management to userspace or system service scripts.

RPMsg provides the message-passing protocol that runs over the shared memory regions established by remoteproc. It uses VirtIO rings — vring data structures — as the transport abstraction: one vring carries messages from the host to the remote, one carries messages from the remote to the host. Hardware mailbox interrupts notify each side when new messages are available in the vring. From the application perspective, RPMsg appears as named channels and endpoints: the remote firmware announces a service name, the Linux RPMsg bus creates a corresponding device, and a driver or userspace application communicates with the remote service through send and receive operations on the channel.

RPMsg message size has a default buffer limit of 512 bytes, which is sufficient for control messages and small data payloads but insufficient for bulk sensor data or image frames. When large data transfers are needed — passing a DMA-captured sensor buffer from the DSP's processing result to the Linux application layer — the pattern is to pass a physical address and size through the RPMsg control channel rather than copying the data through the ring buffers. The Linux side maps the indicated physical region, reads the result, and signals completion through a return message. This zero-copy pattern keeps the inter-processor messaging overhead proportional to the control message size rather than the data size.

OpenAMP extends this infrastructure to the co-processor side. The OpenAMP library provides RTOS-agnostic remoteproc and RPMsg implementations that run on FreeRTOS, Zephyr, or bare-metal firmware, presenting APIs consistent with the Linux kernel's RPMsg semantics. A Zephyr application on an STM32 Cortex-M4 co-processor and a FreeRTOS application on a TI R5F both use OpenAMP to communicate with the Linux host through the same vring-based transport, with the hardware-specific interrupt signaling encapsulated in the libmetal hardware abstraction layer.

Can your firmware coordinate CPU, DSP, and NPU workloads without wasting silicon?

NPU Workload Submission — A Different Execution Model

The NPU presents a fundamentally different orchestration challenge from the CPU-to-CPU or CPU-to-DSP case. The DSP and real-time CPU cores run firmware that responds to messages and executes operations — they are programmable processors with their own execution contexts. Most embedded NPUs are not programmable processors in this sense. They are DMA-driven accelerators: the host CPU submits a compiled computation graph (a model binary generated by the vendor's compiler toolchain), an input buffer, and an output buffer, and the NPU hardware executes the graph against the input and signals completion through an interrupt or by writing to a status register. There is no NPU-side firmware in the conventional sense.

This execution model means that NPU orchestration is entirely managed from the host CPU side. The host CPU's role is to prepare the input tensor buffer — from a camera capture, a sensor DMA transfer, or the output of a DSP processing stage — load it to the physical address the NPU DMA engine will read from, submit the inference job to the NPU driver, and wait for the completion signal. The NPU driver abstracts the hardware-specific submission interface: on NXP i.MX8M Plus, this involves the neutron NPU HAL; on Rockchip RK3588, the NPU is accessed through the RKNPU kernel driver and the RKNN runtime; on TI TDA4VM, the MMA/TIDL framework handles graph compilation and runtime submission.

The compilation step that produces the NPU-executable model binary is a one-time pre-deployment activity, not a runtime operation. The vendor compiler (NXP's eIQ toolkit, TI's TIDL import tool, Rockchip's RKNN-Toolkit, Qualcomm's QNN SDK) takes a standard model format — ONNX, TFLite, PyTorch — and converts it to a hardware-specific binary that encodes the computation graph, quantization parameters, and memory layout for the target NPU. The compiled binary is stored in the filesystem or embedded in the firmware image and loaded by the runtime at application startup.

The performance characteristic that matters most for orchestration is that NPU inference has a fixed, predictable execution time for a given model and input size — a property that neither the CPU nor the DSP consistently provides for equivalent workloads. A quantized MobileNetV2 on a 2 TOPS NPU takes approximately the same time on every invocation, with variance measured in microseconds. This predictability makes pipeline scheduling straightforward: if the NPU takes 10 milliseconds per inference at 30 fps input rate, the NPU utilization is 30 percent and 70 percent of the NPU's time is available for other inference jobs if the hardware supports concurrent submission — which most edge NPUs do not, making serialized single-queue submission the standard model.

Shared Memory and Bandwidth Contention

All processor cores on a heterogeneous SoC share the DRAM interface. The NPU streaming its weight data from external memory, the DSP running DMA transfers for its audio processing pipeline, and the Cortex-A Linux kernel paging in application code are all simultaneously competing for DRAM bandwidth through the shared memory controller. This contention is the primary performance degradation mechanism that heterogeneous SoC firmware orchestration must manage, and it is the one most commonly neglected during initial system bring-up when only one subsystem is running at a time.

The practical manifestations of unmanaged bandwidth contention are: NPU inference taking significantly longer than the standalone benchmark because the DRAM is simultaneously serving DSP DMA transfers; real-time CPU tasks missing their deadlines because the memory controller is busy with bursts from a video capture DMA; and Linux application latency spiking when the NPU starts a large weight-fetch sequence.

Several architectural strategies mitigate contention. Scratchpad memory pre-loading — transferring the NPU model weights from DRAM to the NPU's internal scratchpad at startup, rather than streaming them from DRAM on every inference — eliminates the weight-fetch bandwidth demand during inference, leaving DRAM bandwidth available for input tensor transfer only. Temporal scheduling of high-bandwidth operations — ensuring that the DSP's large DMA burst and the NPU's inference job do not execute concurrently — reduces peak contention at the cost of reduced parallelism. Hardware-level bandwidth throttling, available on some SoCs through memory controller QoS registers, reserves a minimum DRAM bandwidth allocation for real-time-critical processors and limits the maximum bandwidth a non-critical processor can consume.

The TI J7ES and TDA4VM SoC families implement hardware firewalls and bandwidth limiters that can be configured to provide guaranteed bandwidth to the R5F real-time cores while bounding the bandwidth available to the DSP and accelerators. These configurations are set up during the boot sequence and persist for the operating lifetime. NXP's i.MX8M family provides NoC QoS registers for similar purposes. Using these mechanisms requires understanding the bandwidth demand of each concurrent workload — knowledge that must come from profiling representative workloads on the target hardware, not from simulation.

oftware Model for Heterogeneous Cores

Boot Sequence and Firmware Lifecycle Design

The boot sequence of a heterogeneous SoC system determines the availability ordering of each subsystem, and its design has significant implications for system startup time, safety behavior, and firmware update handling.

The typical Linux-AMP boot sequence starts with the bootloader (U-Boot), which initializes the hardware, loads the Linux kernel and devicetree, and optionally pre-loads co-processor firmware images to their designated memory regions before releasing the Cortex-A from reset. This pre-load approach allows the real-time co-processors to start executing before Linux completes its boot sequence, which matters when the real-time core must begin processing sensor data or maintaining safety interlocks that cannot wait for the full Linux boot. U-Boot remoteproc commands provide the mechanism for this on supported platforms.

After Linux boots, remoteproc can load and start co-processor firmware from the Linux side, triggered either by a systemd service or by the remoteproc udev rules that automatically start firmware when the corresponding remoteproc device appears. This post-boot loading approach is simpler to manage for firmware updates — the co-processor firmware binary is updated in the filesystem through the normal OTA update mechanism, and the updated firmware loads on the next co-processor restart — but adds latency between Linux boot completion and co-processor availability.

Firmware lifecycle for co-processors in a field-deployed product requires specific handling that the standard remoteproc documentation assumes rather than specifies. The co-processor firmware binary must be versioned and authenticated separately from the Linux root filesystem update, because it may need to be updated independently when a DSP signal processing algorithm changes or an NPU model is refined. The remoteproc sysfs interface allows loading a new firmware image to a running system without rebooting — stopping the co-processor, replacing the firmware binary, and restarting — which is the field update mechanism for co-processor firmware when full system reboots are disruptive.

Crash recovery is a specific lifecycle requirement that the firmware architecture must address. If the DSP firmware crashes due to a bug — a common occurrence during development and an occasional occurrence in production — remoteproc can detect the crash through a watchdog or through the co-processor entering a fault state, collect a crash dump for debugging, and restart the co-processor with the existing firmware binary. This recovery behavior must be designed and tested explicitly; a co-processor crash that hangs the system or corrupts shared memory regions without recovery is a product reliability problem.

Toolchain and SDK Fragmentation

The practical difficulty of heterogeneous SoC firmware development that no architecture diagram fully conveys is toolchain fragmentation. Each processor on the SoC requires a different toolchain, produces a different binary format, and is debugged with different tools.

The Cortex-A Linux application is compiled with GCC or Clang targeting arm64-linux-gnu and debugged with GDB through sshd or gdbserver. The Cortex-R5 RTOS firmware is compiled with an arm-none-eabi cross-compiler and debugged through JTAG using a J-Link or XDS debug probe with a vendor-specific IDE or OpenOCD. The DSP firmware is compiled with the vendor's DSP compiler (TI's TI CGT for C7x, Cadence's Xtensa tools for HiFi DSPs) and requires the vendor's DSP profiler and simulator for performance optimization. The NPU model is compiled off-device with the vendor's ML compiler toolkit on a development workstation and tested with the vendor's simulation environment before deployment.

Each of these toolchains has its own build system, its own dependency management, and its own debug workflow. The firmware team typically needs engineers with competency in at least two of these toolchain domains, and the integration testing that validates the full multi-processor system requires all of them to be running simultaneously on target hardware.

SoC vendors have invested in unified SDK environments to reduce this fragmentation: TI's Processor SDK RTOS integrates toolchains for all processor types on the J7ES family with a unified build system; NXP's MCUXpresso SDK covers the M-core co-processors on i.MX8 with consistent tooling; Qualcomm's AI Stack provides unified abstractions for the Hexagon DSP and NPU alongside the CPU application layer. These SDKs reduce but do not eliminate the toolchain diversity — the Linux application developer and the DSP firmware developer still work in different environments — and their maturity and documentation quality varies significantly across vendors.

Quick Overview

Heterogeneous SoC firmware orchestration assigns workloads to the processor type best suited for each class of computation — application CPU for OS-dependent control logic, real-time CPU for deterministic I/O and safety interlocks, DSP for vectorized signal processing, NPU for neural network inference — and manages the inter-processor communication and shared resource coordination between them. The Linux kernel's remoteproc and RPMsg frameworks, extended to co-processor RTOS environments through OpenAMP, provide the standard infrastructure for firmware lifecycle management and message-passing in AMP configurations. DRAM bandwidth contention is the primary runtime performance hazard requiring explicit management through scratchpad pre-loading, temporal scheduling, and hardware QoS configuration. Toolchain fragmentation across processor types remains the primary engineering challenge, partially addressed by vendor SDK environments.

Key Applications

Automotive ADAS and radar processing platforms where real-time DSP signal chains and NPU object detection must run concurrently with Linux-hosted tracking and communication software, industrial vision inspection systems combining high-speed image capture on a real-time core, feature extraction on a DSP, and classification inference on an NPU, smart camera and video analytics platforms on RK3588 or i.MX8M Plus where the ISP, NPU, and application processor operate on the same sensor data in pipelined fashion, medical device platforms combining RTOS-based physiological signal processing with AI-assisted diagnostic inference, and grid-edge monitoring systems where deterministic signal measurement runs alongside on-device ML analytics.

Benefits

Concurrent execution of workloads on their optimal processor types reduces total system latency compared to routing everything through the application CPU. Power efficiency improves substantially because specialized accelerators execute their target workloads at a fraction of the energy per operation compared to a general-purpose CPU. NPU inference predictability — consistent execution time regardless of system load — enables tight pipeline scheduling with defined latency guarantees. Architectural separation between real-time co-processor functions and Linux application logic provides fault isolation: a Linux crash does not affect the real-time core, and a co-processor firmware crash can be recovered through remoteproc restart without a full system reboot.

Challenges

Toolchain fragmentation across processor types requires engineering teams with multiple specialized skill domains and significantly complicates the build, test, and debug infrastructure. DRAM bandwidth contention between concurrent subsystems is invisible during per-subsystem bring-up and only manifests during integrated system testing. Vendor SDK quality and documentation for co-processor toolchains often lags behind the application processor toolchain maturity, and BSP support for the remoteproc/RPMsg infrastructure on newer SoC families is not always complete in the mainline kernel. NPU model compilation toolchains are vendor-specific and tied to hardware generations, creating porting effort when migrating to a new SoC family.

Outlook

The integration of dedicated NPU accelerators into edge SoCs has moved from differentiation to commodity: by 2025 it is standard across NXP, TI, Rockchip, Qualcomm, and emerging RISC-V SoC families. The software orchestration tooling is following with a lag. IREE (Intermediate Representation Execution Environment), developed by Google Research and partnered with Synaptics and others for IoT edge deployment, targets the toolchain fragmentation problem by providing an open-source end-to-end compilation and runtime that generates optimized code for CPU, GPU, and NPU backends from a single MLIR representation. OpenAMP continues to expand its scope beyond RPMsg and remoteproc toward power management and non-CPU device lifecycle management. As heterogeneous SoC designs add more specialized accelerators per chip generation, the orchestration software layer that routes workloads, manages bandwidth, and coordinates lifecycle becomes the defining engineering challenge of embedded system software.

Related Terms

AMP, asymmetric multiprocessing, SMP, remoteproc, RPMsg, OpenAMP, virtio, vring, libmetal, DSP, NPU, Cortex-R5, Cortex-A72, C7x DSP, MMA, TIDL, RKNPU, eIQ, QNN, RKNN, TFLite, ONNX, MLIR, IREE, NoC QoS, DRAM bandwidth, scratchpad, DMA, hardware mailbox, IPCC, TDA4VM, i.MX8M Plus, RK3588, PolarFire SoC, J7ES, INT8 quantization, inference latency, resource table, lifecycle management, firmware update co-processor, FreeRTOS, Zephyr, TI-RTOS, RTOS bare-metal, zero-copy transfer, memory contention, hardware firewall

 

 

Our Case Studies

 

FAQ

What is the difference between remoteproc and RPMsg in heterogeneous SoC firmware?

 

Remoteproc handles the lifecycle management of co-processors from the Linux host: it loads firmware binaries, configures memory and IOMMU, releases the co-processor from reset, and shuts it down. RPMsg is the message-passing protocol that runs over the shared memory regions remoteproc establishes — it provides named channels and endpoints through which Linux drivers or userspace applications and co-processor firmware exchange messages at runtime. Remoteproc is about starting and stopping processors; RPMsg is about communication while they are running.

 

How is NPU inference submission different from submitting work to a DSP?

 

A DSP is a programmable processor that runs its own firmware, responds to messages, and executes arbitrary code in its own context. Submitting work to a DSP means sending a message through RPMsg requesting a specific computation. An embedded NPU is typically a DMA-driven accelerator, not a firmware-running processor: the host CPU prepares input buffers, submits a compiled computation graph through a hardware register interface, and waits for a completion interrupt. There is no NPU-side firmware responding to messages; the NPU's behavior is entirely defined by the pre-compiled graph binary and the hardware's fixed execution engine.

 

Why does DRAM bandwidth contention matter for heterogeneous SoC firmware performance?

 

All processor cores on a heterogeneous SoC share the DRAM interface. When the NPU streams model weights from external memory, the DSP runs DMA transfers for its signal processing pipeline, and the application CPU fetches code and data simultaneously, they compete for limited DRAM bandwidth through the shared memory controller. This contention degrades performance in ways that are invisible when each subsystem is benchmarked independently. Common consequences include NPU inference taking two to three times longer than standalone benchmarks, and real-time CPU tasks missing deadlines during peak DRAM load. Mitigation requires NPU weight pre-loading to on-chip scratchpad, temporal scheduling of high-bandwidth operations, and hardware QoS configuration where the SoC supports it.

 

What is OpenAMP and when is it needed in a heterogeneous SoC design?

 

OpenAMP is an open-source framework that provides remoteproc and RPMsg implementations for RTOS and bare-metal environments, mirroring the semantics of the Linux kernel's corresponding infrastructure. It is needed on the co-processor side when the co-processor runs FreeRTOS, Zephyr, or bare-metal firmware and needs to communicate with Linux using the same vring-based RPMsg protocol that the Linux kernel uses. Without OpenAMP, the co-processor firmware must implement its own inter-processor communication protocol that may not be compatible with the Linux remoteproc/RPMsg infrastructure. OpenAMP's libmetal hardware abstraction layer handles the platform-specific interrupt and memory details, making the same OpenAMP application code portable across different SoC families.