From Monolithic Model to Composable Pipeline: Running Mixed AI Workloads on Embedded NPU

From Monolithic Model to Composable Pipeline: Running Mixed AI Workloads on Embedded NPU

 

Most embedded AI deployments follow the same pattern: one model, one task, one compiled blob for the NPU. A YOLOv8-nano for detection. A MobileNetV3 for classification. They work independently, communicate through CPU memory copies, and get deployed sequentially. This architecture is adequate for a single-task product. It breaks down when the application needs qualitatively different perception capabilities running concurrently — spatial scene understanding via attention, pixel-level anomaly detection via classical CV, and always-on event detection via spiking networks — all on the same SoC with a shared power budget and a single NPU.

The alternative is a composable pipeline: a unified computation graph that spans classical preprocessing, convolutional or transformer-based feature extraction, and event-driven sparse computation, scheduled on the NPU as an integrated workload rather than as independent blobs orchestrated by the CPU. Doing this correctly requires understanding what each compute paradigm needs from the NPU, where they interfere with each other, and how the compiler exposes enough of the hardware to optimize across their boundaries.

Three Compute Paradigms, Three Resource Profiles

Before composing anything, it is worth being precise about what classical CV, vision transformers, and spiking networks actually demand from embedded hardware, because they differ in almost every dimension.

Classical computer vision operations — Gaussian blur, Sobel gradients, connected component labeling, optical flow — are spatially local and arithmetically light. Their FLOP counts are low but they require full image buffer access and produce intermediate results that neural stages consume downstream. On NPUs with programmable vector cores alongside the MAC array (the NXP i.MX 8M Plus, Rockchip RK3588, and similar SoCs pair their NPU with a dedicated ISP or DSP), classical CV can run on-chip without CPU round-trips. On NPUs that expose only tensor operations, classical preprocessing must be approximated as convolutional layers or executed on the CPU, breaking the continuous compute graph.

Vision transformers — MobileViT-S, TinyViT-5M, EfficientFormer-V2, Swin-T — are MAC-array workloads with one structural complication: the attention mechanism. Multi-head self-attention produces outputs through a softmax over dot-product similarity scores, and softmax is not naturally supported by integer MAC arrays. Most commercial NPUs implement softmax, GELU, and LayerNorm in a secondary programmable unit or fall back to the host CPU, creating a pipeline stall at every attention layer. The FQ-ViT quantization approach resolves this by replacing softmax with a log-int-softmax function computed entirely with bit-shift operations in integer arithmetic, enabling full INT8 inference through attention without any CPU fallback. This is a prerequisite for running a ViT block as a continuous NPU subgraph rather than a split CPU-NPU execution.

Spiking neural networks operate on a fundamentally different resource profile: instead of computing dense activation tensors at every forward pass, they produce sparse binary spike trains where most neurons are silent at any given timestep. Power consumed is proportional to spike density, not to model size — a sparse input with little motion generates far fewer spikes than a dense one, making energy consumption input-adaptive. The BrainChip Akida v1.0, built on TSMC 28nm, implements 80 neuromorphic processing units with 1.2 million virtual neurons, consuming between a few hundred microwatts and a few hundred milliwatts depending on spike density. Intel's Loihi 2, released in late 2024, demonstrates up to 100× lower energy per inference than conventional NPUs on sparse workloads. BrainChip's Akida Pico IP block operates below 1 mW in standby — enabling always-on wake-up detection that would exhaust any conventional NPU running continuously.

The following table maps these three paradigms to their hardware resource demands:

Paradigm

Primary compute

Memory pattern

Sparsity

Power profile

Classical CV

Vectorized integer ops

Streaming image buffer

Dense

Low, fixed

Vision transformer (ViT)

MAC array + attention

Random access, KV cache

Dense

High, fixed per frame

Spiking network (SNN)

Neuron-synapse accumulate

Sparse event addresses

Input-adaptive

Milliwatt, input-proportional

Why Sequential Chaining Is Not Enough

Connecting models through CPU memory copies is the naive pipeline approach and its costs are predictable. Each CPU handoff adds DMA transfer latency, tensor format conversion, and memory allocation overhead — typically 3–8 ms per transition on Cortex-A class processors. A three-model pipeline with two handoffs adds 10–16 ms before the first model even starts processing the next frame, which on a 30 FPS input stream leaves no margin for the models themselves.

The deeper cost is missed optimization. NPU compilers achieve substantial speedup by fusing sequences of operations — Conv-BN-ReLU into a single compiled kernel, depthwise followed by pointwise convolution as a single tensor program, QKV projection fused with scaled dot-product attention. These fusions are visible to the compiler when the full pipeline is expressed as a unified operator graph. They are invisible when the pipeline is split into independently compiled blobs that the CPU connects at runtime.

A third cost is redundant computation. A classical edge detector and a convolutional feature extractor both compute spatial gradients in their respective early stages. When they are separate programs these gradients are computed twice. A unified graph allows the compiler to compute the shared prefix once and route its output to both downstream branches, recovering latency and DRAM bandwidth.

The engineering alternative is expressing the complete pipeline in a single intermediate representation — ONNX, TFLite FlatBuffer, or a vendor-specific graph IR — with classical CV operations expressed as equivalent tensor operations or custom registered ops, ViT layers expressed with integer-compatible attention approximations, and the SNN front-end expressed as a preprocessing module that converts dense frame differences into spike events before the neuromorphic stage. This representation then passes through the NPU compiler as a single compilation unit.

Fitting Vision Transformers on Sub-10 TOPS NPUs

The practical question for a 6-TOPS class NPU (RK3588, Ambarella CV5, Qualcomm QCS8250) is which ViT architecture fits at camera-frame-rate latency with acceptable accuracy. The answer depends on INT4/INT8 quantization and whether the attention layers can execute without CPU fallback.

At INT8, TinyViT-5M achieves 80.4 percent ImageNet top-1 accuracy in 5.4 MB of model weight. At INT4 weights with INT8 activations, the same model compresses to 2.7 MB with 78.8 percent accuracy — a configuration that fits entirely in the on-chip SRAM of higher-end embedded NPUs, eliminating DRAM weight reads during inference. EfficientViT-B1 reaches 79.4 percent at FP32 and 78.5 percent at INT8, with combined pruning and INT8 quantization achieving 87 percent size reduction and 76 percent energy savings compared to FP32 baseline, with approximately 3.5 percent accuracy loss. For 2025 edge hardware targeting industrial inspection, smart camera analytics, or robotics perception, this is the deployment range that is actually achievable without a dedicated cloud-class inference board.

The non-linear operator problem is the implementation constraint that does not appear in benchmarks. Most ViT benchmarks are run on GPUs where softmax, GELU, and LayerNorm are natively accelerated. On embedded NPUs these same operators either stall or fall back to the host CPU. Confirming that the target NPU supports integer-only attention — through vendor SDK documentation, not through assumed feature parity — is the first verification step in any ViT-on-NPU evaluation.

Integrating the Spiking Stage for Always-On Triggering

The most practical integration point for spiking networks in a composable embedded pipeline is not as a primary inference engine but as an always-on gating layer that decides when to wake the dense NPU. This architecture exploits the SNN's milliwatt-range power in standby while preserving the dense NPU's superior accuracy for the frames that matter.

The pipeline structure for this approach is:

  1. Always-on SNN front-end: consumes the raw sensor stream or frame differences, generates spike events only when meaningful change is detected — motion threshold exceeded, anomaly pattern detected, keyword triggered
  2. Wake signal routed to the dense NPU power domain, which transitions from clock-gated standby to active inference
  3. Dense ViT or CNN pipeline processes the frame that triggered the wake event
  4. Classical CV post-processing validates the detection against spatial priors before committing the result

In this configuration the SNN is responsible for temporal filtering — discarding the majority of frames where nothing relevant is happening — and the dense ViT handles the semantic understanding of the frames that pass the filter. The SNN's 6 FPS throughput limitation and approximate 160 ms latency on the Akida v1.0 are not constraints here because its role is detection, not recognition. Recognition happens on the denser model that the SNN triggers.

The power arithmetic for this architecture is compelling. A Cortex-A SoC with a 6-TOPS NPU running at full utilization consumes 3–5 W. An Akida-class SNN running in monitoring mode consumes under 1 mW. For a product that experiences meaningful events 5 percent of the time — a surveillance camera in a quiet environment, an industrial sensor monitoring idle machinery — the energy consumed by the SNN-gated architecture is less than 10 percent of the always-on dense NPU alternative.

The implementation challenge is the interface between the event-based SNN output and the frame-based dense model input. SNNs produce Address-Event Representation (AER) output — a stream of (neuron address, timestamp) pairs — which does not directly map to the spatial feature tensor that a ViT patch embedding expects. The conversion layer, whether implemented as a temporal aggregation window that accumulates spikes into a dense feature map or as a learned embedding that maps AER events directly to patch tokens, must be designed explicitly and verified against the SNN's output format.

 

The Compiler and SDK Constraint

 


The Compiler and SDK Constraint

All of the above is only executable if the NPU's compiler and SDK support multi-model graph compilation, heterogeneous operator types, and the specific quantization modes that the mixed pipeline requires. This is where the theoretical composability of the architecture meets the practical limitations of 2025 toolchains.

The current state of embedded NPU compiler support for composable mixed pipelines is:

  • ONNX → TensorRT (NVIDIA Jetson): mature support for fused ViT graphs including integer attention approximations; no native SNN support; SNN must be implemented on a separate substrate or simulated as a sparse CNN
  • ONNX → OpenVINO (Intel): strong ViT support via OpenVINO model optimizer; limited classical CV fusion; no native SNN support
  • RKNN toolkit (Rockchip RK3588): INT4 and INT8 ViT support added in 2024; classical CV approximation via convolutional equivalents; no neuromorphic support
  • BrainChip MetaTF: supports ANN-to-SNN conversion from TensorFlow/Keras models; supports YOLOv2-equivalent SNN detection; no native ViT block support as of early 2025
  • SynSense Speck and DYNAP-CNN: event-driven processing focused on CNNs, limited transformer support

The practical implication is that a fully unified compiler that ingests a graph containing classical CV ops, ViT blocks, and SNN layers and produces a single scheduled output for a single chip does not exist as a commercial product today. Composability in 2025 is achieved by partitioning the graph across two substrates — a conventional NPU for the dense ViT/CNN stages and a neuromorphic co-processor for the SNN stage — with a well-defined interface between them, and by using a conventional NPU compiler for the dense portion with classical CV approximated as convolutional ops.

This is not a theoretical barrier to building the pipeline; it is a description of the implementation architecture that works on available hardware. The degree of composability — how much operator fusion and shared intermediate representation is actually achievable — increases as compilers mature and as SoC vendors begin integrating neuromorphic IP blocks (like Akida Pico) alongside conventional NPU cores on the same die.

Quick Overview

A composable embedded AI pipeline combines classical computer vision preprocessing, dense neural inference via convolutional networks or vision transformers, and event-driven detection via spiking neural networks into a unified computation graph scheduled on the NPU as an integrated workload. The primary motivation is latency reduction through operator fusion and elimination of CPU handoffs, energy reduction through SNN-based event gating that activates the dense NPU only when needed, and improved intermediate representation sharing across pipeline stages. Current commercial toolchains require partitioning the pipeline across a conventional NPU for dense stages and a neuromorphic co-processor for SNN stages, with ONNX as the primary IR for the dense portion. Full single-chip unification is on the roadmap as neuromorphic IP blocks begin integrating alongside conventional NPU cores.

Key Applications

Industrial inspection systems combining always-on SNN anomaly detection with ViT-based semantic defect classification triggered on detected events, smart camera analytics platforms processing multiple concurrent detection and classification tasks against a shared NPU budget, ADAS perception nodes requiring both spatial scene understanding from transformer-based models and low-latency motion event detection for safety-critical triggers, wearable health monitoring devices using sub-milliwatt SNN keyword detection to gate a larger model for full inference, and robotics perception pipelines combining optical flow and feature tracking from classical CV with object recognition from a compact ViT.

Benefits

SNN-gated pipeline architectures reduce total system power by up to 90 percent compared to always-on dense NPU execution for applications where meaningful events occupy less than 10 percent of the input stream. Operator fusion across the full graph reduces end-to-end latency by eliminating CPU serialization overhead between models — typically 10–16 ms per pipeline transition on Cortex-A class processors. INT8 and INT4 quantization of ViT models produces 3–4× inference speedup on embedded NPUs with 78–80 percent ImageNet top-1 accuracy preserved, fitting models like TinyViT-5M within 5.4 MB and within on-chip SRAM of higher-end NPUs.

Challenges

Commercial NPU compilers do not yet support unified graph compilation across spiking and dense model types; the composable pipeline in 2025 requires explicit partitioning across two substrates with a designed AER-to-tensor interface layer. ViT attention operators including softmax, GELU, and LayerNorm are not natively supported by all embedded NPU MAC arrays and produce CPU fallback or pipeline stalls unless integer-only approximations are used in the model. Training SNN models for domain-specific edge tasks requires specialized frameworks — BrainChip MetaTF, Intel Lava, PyNN — that have limited interoperability with mainstream PyTorch and TensorFlow training pipelines.

Outlook

Neuromorphic IP integration alongside conventional NPU cores on production SoCs — represented by BrainChip's Akida Pico licensing model and similar initiatives — is the trajectory toward true single-die composable pipelines. As silicon vendors add neuromorphic cores to their NPU subsystems, compiler support will follow, and the AER-to-tensor interface problem will move from application code into the hardware abstraction layer. The ONNX operator set extension for event-based operators, currently in proposal stage, would enable end-to-end graph representation and compilation of mixed pipelines without vendor-specific SDK fragmentation.

Related Terms

composable AI pipeline, vision transformer, ViT, MobileViT, TinyViT, EfficientFormer, EfficientViT, spiking neural network, SNN, neuromorphic computing, BrainChip Akida, Akida Pico, Intel Loihi 2, IBM TrueNorth, Address-Event Representation, AER, NPU, neural processing unit, INT8 quantization, INT4 quantization, quantization-aware training, QAT, post-training quantization, log-int-softmax, FQ-ViT, attention mechanism, softmax fallback, ONNX, TFLite, TensorRT, OpenVINO, RKNN, MetaTF, operator fusion, MAC array, on-chip SRAM, event-driven gating, wake-up detection, classical computer vision, optical flow, ISP, image signal processor, RK3588, NXP i.MX 8M Plus, Ambarella CV5, edge inference, TinyML, mixed-precision inference, knowledge distillation

 

Contact us

 

 

Our Case Studies

 

FAQ

What makes vision transformer attention layers hard to run on embedded NPUs?

 

The softmax normalization inside multi-head attention requires division and exponentiation across the full sequence of attention tokens, which does not map to the MAC array structure of conventional NPUs. Most embedded NPU compilers either implement softmax in a secondary processing element with associated pipeline stall, or fall back to the host CPU for that operation alone. This breaks the continuous NPU execution graph at every attention layer, adding CPU handoff latency per ViT block. Integer-only attention approximations, specifically log-int-softmax using bit-shift operators, resolve this by allowing the full attention computation to execute on the integer MAC array without dequantization to floating point.
 

What is the practical role of spiking networks in an embedded AI pipeline today?

 

The most deployable role for SNNs in 2025 embedded pipelines is as an always-on event detector that gates a denser downstream model. The SNN's input-adaptive power consumption, milliwatts when inputs are sparse, scaling with spike density when activity increases, makes it suited for continuous monitoring tasks where meaningful events are rare. It eliminates the energy cost of running a dense NPU on every frame by triggering that NPU only when the SNN detects something worth analyzing. BrainChip's Akida Pico IP operates below 1 mW in standby. Intel Loihi 2 demonstrates up to 100× energy reduction per inference versus conventional NPUs on sparse workloads. The SNN's limitations in throughput and semantic accuracy make it a poor replacement for a dense model on complex recognition tasks.
 

How does quantization affect vision transformer accuracy on embedded NPUs?

 

INT8 quantization with quantization-aware training produces 1–3 percent top-1 accuracy loss on ImageNet for most efficient ViT families relative to FP32 baseline. INT4 weight quantization with INT8 activations produces approximately 3.5 percent accuracy loss while reducing model size by a further 50 percent. TinyViT-5M reaches 78.8 percent top-1 at INT4/INT8 in 2.7 MB, small enough to fit in on-chip SRAM on higher-end NPUs, eliminating DRAM weight reads during inference and improving both latency and power. Combined pruning and INT8 quantization has been shown to achieve 87 percent size reduction and 76 percent energy savings at 3.5 percent accuracy cost, which is the relevant range for real embedded deployment decisions.
 

What is Address-Event Representation and why does it matter for composable pipelines?

 

AER is the output format of spiking neural networks: a stream of (neuron address, timestamp) pairs recording which neurons fired and when, rather than a dense activation tensor. It is efficient for transmission and storage because it encodes only the non-zero events, but it is not directly compatible with the spatial feature tensors that convolutional or transformer-based models expect as input. In a composable pipeline where an SNN front-end feeds a dense ViT back-end, an interface layer must convert AER spike accumulations into a dense spatial representation, either by binning spikes into a temporal window and projecting them onto a spatial grid, or by a learned embedding that maps events directly to patch tokens. This interface design is the key engineering problem in combining neuromorphic and transformer stages in a single pipeline.