The Cannot-Fail Core: Safety Island Architecture for Heterogeneous SoCs

The fundamental problem in safety-critical SoC design is that the most capable compute elements — GPUs processing camera feeds, NPUs running object detection models, DSPs handling radar signal processing — are rarely developed to ASIL-D. They serve mass markets where such certification investment makes no economic sense for the IP vendor. Yet these are exactly the compute elements that modern ADAS and autonomous driving systems need to perform the perception work that safety goals depend on.

The answer the industry settled on is the safety island. Rather than elevating every IP block on the SoC to ASIL-D — an approach that is both economically irrational and technically intractable for general-purpose AI accelerators — the safety island architecture concentrates ASIL-D compliance in a small, carefully designed subsystem. Everything else on the SoC can be QM or ASIL-A/B. The safety island monitors those subsystems, detects their faults, orchestrates their recovery, and holds the safety state of the overall system regardless of what happens elsewhere on the die.

Understanding how to design this architecture — what must be in the island, what can sit outside it, how ASIL decomposition distributes safety obligations across heterogeneous elements, and how the network-on-chip connects them without becoming a safety liability — is the core competency for anyone building safety-critical SoCs today.

What Makes a Safety Island Different from the Rest of the SoC

A safety island is not simply a processor with ECC on its memories. It is a subsystem designed with a specific property: it continues to function correctly under all fault conditions that affect the rest of the SoC, including faults in the NoC, faults in non-safety IP, and faults in power distribution that do not affect the island's own power domain.

Achieving this requires a set of design decisions that isolate the island from every failure mode the rest of the SoC can produce. NVIDIA's Functional Safety Island (FSI) in the Orin SoC is a production example that makes these decisions explicit:

The processor fabric is four Dual-Core Lockstep (DCLS) Cortex-R52 cores. Each DCLS pair runs identical software on two separate cores with a time offset, then compares outputs. Any divergence — from a soft error, a hardware fault, or a systematic error in the non-lockstepped core — is detected at the comparison boundary and flagged before the erroneous output can propagate to an actuator command. The Orin FSI provides approximately 10,000 ASIL-D MIPS available for safety functions including sensor fusion and vehicle control.

The power domain is independent: the FSI runs on separate voltage and power rails from the rest of the SoC. A fault that causes a brownout condition on the main compute rails does not affect the FSI, which can continue executing the fail-safe state machine even as the GPU and NPU clusters go offline.

The memory is private: the island has its own tightly coupled memory (TCM) — Cortex-R52's ATCM, BTCM, CTCM — accessible only to the island's cores, with ECC protection. There is no path for a DMA operation from a QM GPU driver or a corrupted NPU inference to overwrite the island's safety state.

The connectivity is dual-path: the island has a private bus to non-volatile storage and the watchdog timer for functions that must execute without any dependency on the main NoC, and a separate controlled connection to the main system NoC for monitoring and coordination. The private path exists precisely because the main NoC cannot be assumed to be fault-free.

Dream Chip Technology's production ADAS SoC demonstrates the same architecture at a smaller scale: a dual-core lockstep Cortex-R52 with dedicated TCM, a private watchdog, a dedicated interrupt controller for fault aggregation from the rest of the SoC, and a separate bus to the main NoC. The derivatives of this platform — for active mirror replacement, front camera, and radar applications — each share the same safety island across different vision processor configurations, demonstrating the architecture's scalability across a product family without re-certifying the island for each variant.

ASIL Decomposition and What It Actually Requires

ASIL decomposition is the ISO 26262-9 mechanism that allows a single ASIL-D safety requirement to be split across two independently developed and sufficiently independent elements, where each element carries a lower ASIL. The standard permissible decompositions are ASIL-D → ASIL-B + ASIL-B, ASIL-C → ASIL-A + ASIL-B, and so on. The independence requirement is strict: the two elements must be developed without unintended interaction, their failure modes must not share common cause, and the analysis of their independence (dependent failure analysis) must be performed and documented.

The word "sufficiently independent" is the critical term. Two lockstep cores on the same die running from the same clock with the same power supply are not sufficiently independent for some common-cause failure modes — a voltage spike or a heavy-ion strike can affect both cores simultaneously, which is why Infineon's AURIX uses diverse lockstep (clock delay between paired cores) and NVIDIA's Orin FSI uses time-offset DCLS execution. The independence argument for ASIL decomposition must address all of the failure modes that could cause both channels to fail simultaneously, and the independent power domain for the safety island is part of that argument.

For heterogeneous SoCs, ASIL decomposition between the safety island and the main compute fabric is the primary mechanism for achieving system-level ASIL-D with non-ASIL-D IP. The decomposition pattern is:

Channel A: the safety island executes a safety monitor or a simplified reference calculation at ASIL-B or ASIL-D. Channel B: the main compute fabric (GPU, NPU, Cortex-A cluster) executes the primary inference or perception function at ASIL-B or QM. The system-level safety argument is that a failure in the main compute fabric that produces incorrect output will be detected by the safety island through comparison, plausibility checking, or timeout monitoring, and the island will command a transition to safe state before the incorrect output reaches an actuator.

The following table maps decomposition configurations to their system-level effect:

Island ASIL	Compute fabric ASIL	Decomposition	System-level coverage
ASIL-D (island alone)	QM (monitored only)	Not decomposition — island provides all safety	Island must catch all compute fabric faults
ASIL-D island + ASIL-B fabric	ASIL-B	D = D (island provides full coverage + B fabric reduces demands on island)	Conservative; higher development cost on fabric
ASIL-B island + ASIL-B fabric	ASIL-B	B + B = D (standard decomposition)	Must demonstrate independence; fabric must have ASIL-B development process
ASIL-D island, QM GPU with ASIL-D island monitor	QM	Island monitors GPU, detects output plausibility failures	Common architecture for NPU/GPU in ADAS

The last row is the most commercially common architecture in 2024–2025: a QM or ASIL-A GPU or NPU runs the perception algorithm, and the ASIL-D safety island monitors the outputs for plausibility before they are accepted into the safety-critical path. The safety argument is not that the GPU is reliable — it is that the island's monitoring catches GPU errors before they propagate. This requires the monitoring to have sufficient coverage: if the GPU can produce a class of systematic output error that the island's plausibility check does not detect, the system-level safety argument fails at that point.

The Network-on-Chip as Safety Infrastructure

In a heterogeneous SoC with a safety island, the network-on-chip is not a passive interconnect. It is an active participant in the safety architecture because every data path between the safety island and the monitored IP traverses it. A fault in the NoC can corrupt messages between the island and the IP it monitors, producing false "all clear" signals or blocking fault notifications. If the NoC is not itself designed to ASIL standards, the safety architecture has a hidden gap.

Arteris IP addresses this by building three functional safety mechanisms into their NoC IP: timeout checking, isolation, and end-to-end ECC protection. Timeout checking detects transient faults by identifying when a request does not receive a response within its expected time window, then generating an interrupt to the safety island. Isolation allows the NoC safety controller to disconnect power to a socket connected to a faulty IP subsystem, preventing a faulting IP from corrupting the bus fabric with bad data or continuous error traffic. End-to-end ECC protection ensures that data integrity is maintained across the full path from source to destination, so that a transient error in the interconnect fabric itself does not silently corrupt a value that the safety island will accept as valid.

The practical implication is that the safety island's safety case depends on the safety properties of the NoC path between the island and the monitored IP. For a system targeting ASIL-D at the system level, the NoC on the path between the safety island and ASIL-D components must itself be developed to ASIL-D or provide ASIL-D equivalent safety mechanisms. For paths between QM IP and the island, the NoC can provide ASIL-B monitoring mechanisms — enough to detect that something went wrong, even if the QM IP's failure mode is uncharacterized.

The Dream Chip ADAS SoC design uses this architecture explicitly: the safety island connects to the main Arteris NoC, which provides timeout checking and ECC monitoring for all IP subsystems connected to it. When a fault is detected in any IP subsystem — a BIST failure after isolation, a timeout on a peripheral, an ECC error on a data path — the NoC generates an interrupt to the safety island, which then decides the system response: recovery through resetting and re-testing the affected subsystem, degraded mode operation without the affected function, or fail-safe state if the fault is in a function that cannot be safely bypassed.

Safety-Ready AI Accelerators and the Remaining Gap

The commercial availability of ASIL-rated AI accelerator IP is closing the gap that previously forced safety architects to treat all neural network processing as QM. Synopsys' ARC NPX6FS NPU IP is offered at ASIL-B or ASIL-D Ready certification levels, including dual-core lockstep, self-checking safety monitor, windowed watchdog timer, diagnostic error injection, and error classification. The safety documentation package — safety manual, FMEDA, DFMEA — reduces the effort for SoC integrators to establish an ASIL certification argument that includes the NPU in the safety chain rather than treating it as a QM subsystem monitored by the safety island.

This changes the decomposition options available to the architect. An NPX6FS running at ASIL-B does not require the safety island to provide full coverage for NPU errors — the NPU itself provides ASIL-B diagnostic coverage, and the island's monitoring of NPU outputs provides the second ASIL-B channel for a combined ASIL-D argument. The island's monitoring load is reduced, the NPU's certification documentation is already produced by the IP vendor, and the SoC integrator's FMEDA work for the AI inference path becomes an integration exercise rather than a full hardware safety analysis from scratch.

The remaining gap in 2025 is GPU safety certification. GPU IP from major vendors — the Mali GPU series, NVIDIA's Ampere/Ada clusters — is not offered with ASIL-D certifications because the GPU's complex, branching execution model makes DCLS implementation impractical and formal FMEDA analysis extremely difficult given the scale of the IP. The safety architecture for any SoC that uses a conventional GPU in its safety-relevant path therefore must treat the GPU as a QM or ASIL-A subsystem and rely on the safety island's output monitoring to catch GPU errors. This is architecturally feasible but it places demands on the monitoring strategy: the island must implement plausibility checks that are both fast enough not to introduce unacceptable latency in the safety-critical decision path and broad enough to catch the classes of GPU error that can produce plausible but incorrect output values.

Chiplet Architectures and Safety Island Extension

The growing adoption of chiplet-based packaging — where compute fabric, memory, and safety elements are implemented as separate dies connected through die-to-die interconnects — creates new safety architecture considerations that the monolithic SoC safety island model does not directly address.

The research on safety island extension to chiplet platforms identifies three structural challenges. First, the die-to-die interconnect (UCIe, AIB, or similar) introduces a new potential fault location that is not within any single die's safety analysis. A transient fault on the die-to-die link between the safety island's die and the main compute die produces the same effect as a NoC fault: the island may receive corrupted fault notifications or fail to receive them at all. The link must be protected with end-to-end integrity mechanisms equivalent to those required of the on-die NoC.

Second, independent power domains that are trivially achievable on a monolithic SoC become more complex in a chiplet package where multiple dies share a substrate and potentially share power delivery infrastructure. The safety island die's independent power domain argument must account for common-mode power failures in the package, not just on the die.

Third, ASIL decomposition across dies requires demonstrating independence of the decomposed channels at the die level, including independence from common-cause failures in the packaging and die-to-die interface. The Infineon AURIX as external safety host controller for an HPC die — a configuration described in research work on safety islands for HPC platforms — represents the extreme end of this architecture: the safety island is physically and electrically separate from the compute die, connected via a standard automotive bus (SPI, UART, CAN FD), providing the maximum achievable independence for the safety argument at the cost of higher communication latency between island and monitored compute.

Quick Overview

The safety island architecture concentrates ASIL-D compliance in a small, physically isolated subsystem — typically DCLS Cortex-R52 cores with independent power, private TCM, and a dedicated watchdog — while allowing the majority of the SoC's compute fabric (GPU, NPU, Cortex-A cluster) to be developed at QM or ASIL-A/B. ASIL decomposition across the safety island and the monitored compute distributes the safety obligation: either the island provides full ASIL-D monitoring of QM compute outputs, or a certified ASIL-B accelerator and the island together provide ASIL-B + ASIL-B = ASIL-D coverage. The NoC is not a passive element in this architecture — it must provide ECC, timeout checking, and IP isolation to avoid becoming the hidden fault location that breaks the safety case. NVIDIA's Orin FSI (four DCLS Cortex-R52 CPUs, independent power, approximately 10K ASIL-D MIPS) and Infineon AURIX as external safety host represent production implementations of these principles at different integration scales.

Key Applications

ADAS SoCs combining QM GPU camera perception with ASIL-D safety island output monitoring for lane keeping, emergency braking, and obstacle detection, industrial machinery control SoCs with ASIL-C NPU-based anomaly detection monitored by a safety island for fail-safe actuator command, radar SoC designs where DSP-based signal processing is validated by a lockstep safety monitor before target classification reaches the braking system, domain controller architectures consolidating multiple formerly separate ECUs onto a single SoC with a safety island managing the cross-domain safety state, and chiplet platforms where the safety island is implemented as a physically separate compute die from the main HPC compute fabric.

Benefits

Safety island architecture enables the use of commercial GPU and NPU IP — developed for cost-effectiveness in mass markets rather than for safety certification — in safety-critical SoCs, by isolating the safety obligation in a small certified core rather than propagating ASIL-D requirements to every IP block. Derivative SoCs with different compute configurations (different GPU, different NPU, different core count) can share the same certified safety island, amortizing the ASIL-D certification investment across the product family. ASIL-B certified AI accelerator IP from vendors like Synopsys reduces both the monitoring demand on the island and the integrator's FMEDA work for the AI inference channel.

Challenges

The safety island's plausibility monitoring of QM GPU or NPU outputs must cover the classes of systematic error that the compute element can produce — not just random hardware faults. A GPU that produces geometrically plausible but semantically incorrect object detection output (a pedestrian classified as a roadside barrier) may pass all plausibility checks and still represent a safety-critical failure. The monitoring strategy must be designed against the actual failure modes of the monitored computation, not against a generic fault model. Chiplet architectures extend the safety isolation challenge to die-to-die interconnects and shared packaging infrastructure, requiring updated independence arguments that the monolithic SoC safety island model was not designed to address.

Outlook

Zonal E/E architectures in software-defined vehicles are increasing the compute integration demand that the safety island model must support: a single zone controller may need to host ASIL-D brake control, ASIL-B ADAS, and QM infotainment, requiring a safety island that manages three distinct criticality levels simultaneously. The AURIX TC4xx generation, operating at 500 MHz with full lockstep and CPU virtualization support, demonstrates that safety island processors are gaining the performance to host safety-critical software stacks that previously required dedicated ECUs. Safety-certified NPU IP at ASIL-B and ASIL-D Ready levels, with comprehensive FMEDA documentation, is the trajectory that will gradually close the AI accelerator gap and reduce reliance on pure plausibility monitoring for AI inference outputs in safety paths.

Related Terms

safety island, ASIL decomposition, ASIL-D, ASIL-B, ISO 26262, heterogeneous SoC, dual-core lockstep, DCLS, Cortex-R52, tightly coupled memory, TCM, ECC, independent power domain, watchdog timer, fault aggregation, NoC, network on chip, ECC protection, timeout checking, IP isolation, FMEDA, SPFM, LFM, PMHF, QM, freedom from interference, safety host controller, NVIDIA Orin FSI, Infineon AURIX, ARC EM Safety Island, ARC NPX6FS, Synopsys DesignWare, Arteris IP, Dream Chip Technology, GPU safety, NPU safety, ASIL-B NPU, plausibility monitoring, fail-safe state, safety element out of context, SEooC, die-to-die interconnect, UCIe, chiplet safety, zonal architecture, software-defined vehicle, domain controller, ADAS, dependent failure analysis, common cause failure, diverse lockstep, safety management unit, LBIST, MBIST

The Cannot-Fail Core: Safety Island Architecture for Heterogeneous SoCs

What Makes a Safety Island Different from the Rest of the SoC

ASIL Decomposition and What It Actually Requires

The Network-on-Chip as Safety Infrastructure

Safety-Ready AI Accelerators and the Remaining Gap

Chiplet Architectures and Safety Island Extension

Quick Overview

Our Case Studies

FAQ

Why can a GPU or general-purpose NPU not simply be designed to ASIL-D without a safety island?

What must a safety island contain at minimum to support ASIL-D system-level claims?

What is the role of end-to-end ECC in the NoC for the safety island argument?

How does ASIL decomposition change when a certified ASIL-B NPU is available instead of a QM NPU?