Shift Left for Firmware: How FPGA Emulation Compresses Linux Bring-Up Before Tape-Out

Silicon bring-up used to be a sequential process: design team tapes out, fabrication takes three to six months, the first package arrives on the lab bench, and only then does the firmware team start discovering that drivers are broken, the boot sequence hangs, and the device tree does not match what actually got implemented. At this point, every firmware bug found is found at maximum cost — the hardware cannot be changed without another tape-out, debug visibility is limited by what the silicon itself exposes, and the product schedule slips by the number of weeks it takes to chase issues that could have been found months earlier.

The shift-left answer to this is running the full firmware stack — bootloader, kernel, device drivers, application layer — on an FPGA prototype of the SoC RTL before the design goes to the foundry. The FPGA does not replace the chip; it runs the same register-transfer level description that will eventually be fabricated, at a fraction of the clock speed, on reconfigurable logic that can be modified when bugs are found. What the firmware team gets is a platform that behaves like the target silicon — same memory map, same peripheral registers, same interrupt topology — months before first silicon arrives.

Amlogic's engineering team, presenting at DAC, described exactly this trajectory: they needed to boot simple Linux kernels in minutes and full Android in hours on their pre-silicon platform, with support for UART and JTAG debug and the ability for multiple engineers to connect remotely. The Palladium emulation platform achieved 1 MHz execution speed and the Protium FPGA platform reached 5 MHz in fully automatic mode — both fast enough for OS-level software development at a fraction of the clock speed of production silicon, but running the actual chip RTL rather than a behavioral model.

Three Pre-Silicon Platforms and Their Tradeoffs

Pre-silicon software development runs on three distinct categories of platform, each at a different point in the accuracy-versus-speed tradeoff:

Virtual platforms (QEMU, Arm Fast Models) are instruction-accurate software simulators. They model the processor ISA and a set of peripheral behaviors in software running on the host CPU. QEMU can boot Linux for ARM, RISC-V, and other targets at close to native host speed because it translates target instructions to host instructions via just-in-time compilation. What it does not provide is cycle accuracy or RTL fidelity: the peripheral models are hand-written approximations of register behavior, timing does not match actual hardware, and bugs that arise from actual RTL implementation details — pipeline behavior, cache coherency interactions, bus arbitration timing — are invisible. Virtual platforms are valuable for early application software development before hardware design is complete, but they cannot find firmware bugs that arise from the real hardware implementation.

Hardware emulation platforms (Cadence Palladium, Siemens Veloce, Synopsys ZeBu) map the SoC RTL directly onto a custom emulation fabric that executes hardware logic at 1–5 MHz clock rates. These platforms achieve full RTL accuracy — the firmware sees exactly the same register behavior and timing as the eventual silicon — and provide extensive debug visibility including the ability to probe any internal signal, save and restore system state, and replay execution. They are expensive infrastructure (emulation systems cost millions of dollars), typically shared across multiple projects in a central EDA facility, and used for hardware verification as well as software bring-up. The practical speed of 1–5 MHz is sufficient to boot minimal Linux in minutes and run driver regression tests, but full Android boot takes hours.

FPGA prototype platforms (Cadence Protium, Synopsys HAPS, AMD Xilinx-based custom boards) map the SoC RTL onto one or more commercial FPGAs. They run at 5–50 MHz — substantially faster than emulators — and achieve RTL accuracy comparable to emulation, though with less debug visibility and more initial setup effort. Protium leverages the AMD VP1902 Versal Adaptive SoC, the industry's largest production FPGA, to support designs above 1 billion gates. The compilation flow is automated; Cadence's Protium can be brought up from a Palladium emulation design in a few days using the shared compilation flow. At 5–50 MHz, full Linux boot is achievable in minutes and Android boot in tens of minutes, enabling a workflow where firmware developers iterate with something close to real interactive development speed.

The following table compares the platforms across the dimensions that matter most for firmware bring-up:

Platform	Speed	RTL fidelity	Debug visibility	Cost model	Best use
QEMU / Fast Models	Near-native host speed	ISA-accurate, peripheral models approximate	Software debug only	Free / low	Early driver scaffolding, application dev
Hardware emulator (Palladium)	1–5 MHz	Full RTL	Full signal probe, state save/restore	Shared facility, expensive	Hardware-software co-debug, regression
FPGA prototype (Protium, HAPS)	5–50 MHz	Full RTL	Limited, JTAG + UART	High upfront, accessible	Linux/Android bring-up, driver iteration
First silicon	1+ GHz	Ground truth	Post-silicon debug tools	Tape-out cost + schedule	Validation, production qualification

What the FPGA Prototype Runs

Mapping a custom SoC design onto an FPGA for pre-silicon firmware bring-up requires the design team to synthesize the RTL — the same files that will go to the foundry — into FPGA-specific configuration bitstreams. This is distinct from designing an FPGA-based product: the goal is not to optimize for FPGA implementation but to faithfully replicate the register and bus behavior of the target ASIC on available FPGA fabric.

Several constraints make this non-trivial. ASIC designs use memory primitives (SRAM macros, DDR PHY) that do not directly map to FPGA block RAM. The ASIC's clock frequencies, often 1–2 GHz, must be divided down to frequencies achievable by the FPGA implementation — typically 10–50 MHz for a complex SoC. Physical-only design elements — analog blocks, PLLs, I/O PHYs — must be stubbed out with behavioral models or replaced by FPGA-native equivalents. For multi-billion-gate SoCs that exceed any single FPGA's capacity, the design must be partitioned across multiple FPGAs with high-bandwidth interconnects between them.

The Barcelona Supercomputing Center's Makinote platform demonstrates the scale this requires for large designs: 96 AMD Xilinx Alveo U55C FPGAs interconnected to emulate up to 750 million ASIC cells, with specialized PCIe Gen4 and HBM interconnects between them. Their FPGA shell handles the inter-FPGA connectivity automatically, allowing design teams to port RTL to the cluster with minimal manual wiring work. At 32 FPGAs they demonstrated 8× performance improvement over a single FPGA for HPC workloads, and the platform is fast enough to boot Linux on a RISC-V SoC and run application-level benchmarks.

FireSim, the open-source FPGA-accelerated simulation framework from UC Berkeley, takes a different approach oriented toward research and commercial RISC-V SoC development. It runs cycle-accurate RTL simulations at 10–100 MHz on cloud FPGA instances (Amazon EC2 F1, Xilinx Alveo U250/U280), generating hardware models directly from synthesizable Chisel or Verilog RTL, and has been used in the development of commercially-available silicon. A 1024-node cluster simulation on 256 cloud FPGAs, each node running RISC-V RTL with a complete memory system and 200 Gbit/s Ethernet NIC model, demonstrates that the approach scales to full datacenter simulation — each node capable of booting Linux and running memcached at speeds that allow meaningful system-level performance measurements.

The Linux Bring-Up Workflow on FPGA

Linux bring-up on an FPGA prototype follows the same sequence as bring-up on first silicon, but with the ability to iterate that first silicon does not provide. The sequence is:

First stage bootloader (SPL / ATF BL2): the very first code that runs after reset, executing from on-chip ROM or the beginning of a boot device. It initializes DDR DRAM timing, configures clocks, and loads the secondary bootloader. On FPGA this stage must be adapted to match the FPGA's DDR controller model rather than the ASIC's DDR PHY, which is the most common source of early boot failures.

U-Boot: the secondary bootloader that initializes remaining peripherals, sets up device tree, loads the kernel image, and passes control to it. U-Boot's device tree-based peripheral discovery means that peripheral register addresses and interrupt numbers must exactly match the RTL implementation — mismatches here produce silent failures where peripherals initialize without error but do not function correctly.

Device tree generation from RTL handoff: Intel/Altera's SoC FPGA GSRD workflow generates device tree parameters directly from the Quartus hardware project handoff data, ensuring that the Linux device tree for the FPGA prototype matches the actual peripheral implementation. This is the mechanism that eliminates the class of firmware bug where the software was written against a specification document that diverged from what the RTL team actually implemented.

Linux kernel and driver bring-up: once U-Boot passes control to the kernel, driver initialization begins. Each peripheral driver writes to and reads from the register addresses described in the device tree, and any mismatch between the driver's expected register behavior and the actual RTL behavior manifests as a driver probe failure, an oops, or incorrect device behavior. On the FPGA prototype this failure is debuggable: the emulator or FPGA platform can probe internal signals at the point of failure, isolate whether the fault is in the driver or the hardware, and the hardware-side fix requires only an RTL change and FPGA recompile rather than a silicon re-spin.

The Amlogic case captures why this workflow compresses schedule: their pre-silicon platform let them run software driver debug and verification, verify driver functionality against actual RTL, and parallelize verification of different software drivers by partitioning the chip design across multiple emulation sessions — all before tape-out. The software team did not wait for silicon; they found and fixed hardware-software interface bugs while the hardware team still had the ability to correct the RTL.

Specific Classes of Bug That FPGA Bring-Up Finds

The value of pre-silicon FPGA bring-up is not generic "finding bugs earlier" — it is finding specific categories of bugs that only manifest when firmware actually executes against hardware registers. Understanding which bug categories these are clarifies why virtual platform development alone is insufficient.

Register reset-value bugs: many peripherals have specific expected reset values for their configuration registers that drivers read and validate before initialization. When the RTL implementation has a reset value that differs from the specification that the driver was written against, the driver either misidentifies the peripheral version or skips initialization steps it believes are unnecessary. These bugs require actual RTL execution to manifest — they are invisible in QEMU because the peripheral model was written from the same specification document as the driver.

Memory-mapped I/O ordering bugs: the ARM architecture allows out-of-order memory accesses unless specific barriers are inserted. When firmware writes to a peripheral register and immediately reads a status register to confirm the operation, an incorrectly placed memory barrier in the driver allows the CPU to reorder the transactions, producing a status read before the write has reached the peripheral. This is a well-known class of embedded Linux driver bug that only appears on real hardware or RTL-accurate simulation — QEMU serializes all peripheral accesses.

DMA coherency bugs: when a DMA engine writes to memory and the CPU subsequently reads that memory, the CPU's cache may return a stale value if the cache was not properly invalidated before the DMA transfer. Finding the exact point in the DMA controller RTL that requires the cache flush, and confirming the driver's cache maintenance sequence, requires running the actual DMA transactions against RTL-level hardware.

Interrupt topology mismatches: the interrupt controller configuration — which interrupt line each peripheral connects to, the active level or edge polarity, the priority configuration — must exactly match the device tree and driver configuration. A single GIC register misconfiguration produces symptoms ranging from no interrupt delivery to spurious interrupts that destabilize the system. These bugs are found by running the actual interrupt controller RTL with firmware that exercises interrupt-driven peripheral operation.

RISC-V and the Democratization of Pre-Silicon Bring-Up

Custom silicon development was historically the province of large semiconductor companies that could afford both tape-out costs and the expensive EDA infrastructure required for pre-silicon validation. The emergence of open-source RISC-V SoC design frameworks, open-source FPGA emulation tools, and accessible FPGA hardware is extending pre-silicon firmware bring-up to smaller design teams and to organizations building custom silicon for the first time.

Chipyard, the open-source RISC-V SoC construction framework from UC Berkeley, provides a complete path from RTL composition through FPGA emulation to ASIC tapeout, sharing the same RTL across all three execution environments. A custom SoC described in Chipyard can be simulated in Verilator for RTL verification, mapped to FireSim for FPGA-accelerated full-system simulation with Linux boot, and submitted to an open shuttle program (Google/efabless OpenMPW) for ASIC fabrication — all from the same design source. The Makinote cluster at BSC-CNS validated a RISC-V processor design by running HPC Challenge benchmarks on the emulated RTL running on 32 FPGAs, demonstrating that at-scale pre-silicon validation is achievable without proprietary emulation infrastructure.

The practical implication for embedded product teams building custom SoCs on RISC-V is that FPGA-based pre-silicon firmware bring-up is now accessible at a cost point significantly below the Cadence Palladium/Protium infrastructure tier. A development team with access to several Xilinx Alveo boards or cloud FPGA instances can run FireSim-based pre-silicon bring-up, boot Linux on their custom RTL, and iterate on BSP and driver code before tape-out, using the same open-source tools that academic groups have validated on commercially-taped-out designs.

On ARM-based custom SoCs, Intel/Altera's SoC FPGAs (Cyclone V, Arria 10, Agilex 7) provide an integrated path: the Arm Cortex-A hard processor system (HPS) on these devices executes the actual firmware, while the FPGA fabric implements the custom IP blocks from the design under development. The GSRD (Golden System Reference Design) and associated Yocto BSP tooling provide the reference bring-up environment. This is a less flexible emulation model than a pure-FPGA RTL implementation — the processor cores are the FPGA's hard cores rather than the design's target cores — but it is highly accessible and supports complete Linux boot with Yocto-generated images, JTAG debug via Arm Development Studio, and direct validation of custom peripheral IP.

Quick Overview

Pre-silicon firmware emulation on FPGA allows Linux kernel, bootloader, and device drivers to run against the actual SoC RTL before tape-out — finding register-level bugs, DMA coherency issues, interrupt topology mismatches, and device tree errors while the hardware is still modifiable. Hardware emulators (Cadence Palladium) execute RTL at 1–5 MHz with full signal visibility; FPGA prototype platforms (Cadence Protium, FireSim) execute at 5–50 MHz with sufficient speed for iterative Linux and Android bring-up. Neither replaces QEMU for early application development, and neither replaces first silicon for full-speed validation — they fill the critical gap where RTL-accurate firmware validation at software-development speed is required. The shift-left result: firmware bugs that previously required months-long silicon re-spin cycles to fix are found and corrected while the design is still RTL, before tape-out.

Key Applications

Custom SoC development for automotive, industrial, and broadcasting applications where silicon tape-out represents a $500K–$5M commitment and firmware readiness on day-one of silicon availability is a commercial requirement, RISC-V custom processor designs using Chipyard and FireSim to validate BSP and driver code before shuttle tape-out, embedded Linux product teams building application processors with custom peripheral IP where driver-hardware interface bugs are the primary post-silicon schedule risk, and any program where the software team and hardware team must develop concurrently rather than sequentially to meet product schedule.

Benefits

Firmware bugs found on FPGA before tape-out cost orders of magnitude less to fix than bugs found after first silicon: the FPGA fix is an RTL change and bitstream recompile; the post-silicon fix is a full re-spin. Pre-silicon FPGA bring-up allows firmware and hardware teams to work in parallel rather than sequentially, eliminating the months-long wait between tape-out and firmware team productivity. At 5–50 MHz on FPGA prototype platforms, Linux boots in minutes and drivers can be iterated at development pace. FPGA emulation accurately exposes register reset value bugs, memory-mapped I/O ordering requirements, and DMA coherency sequences that virtual platforms built from specification documents cannot reveal.

Challenges

Synthesizing production SoC RTL for FPGA requires adapting ASIC-specific memory macros (SRAM, DDR PHY), analog blocks, and high-speed I/O to FPGA-compatible equivalents, which is a non-trivial engineering effort that requires both RTL expertise and FPGA implementation knowledge. Multi-billion-gate SoCs that exceed a single FPGA's capacity require multi-FPGA partitioning with high-bandwidth inter-FPGA interconnects, adding compilation complexity and sometimes requiring manual timing closure work. The 5–50 MHz execution speed is sufficient for Linux bring-up but impractical for performance characterization or for software that has real-time requirements — those activities still require first silicon.

Outlook

Cloud FPGA instances (Amazon EC2 F1, equivalent offerings) are making pre-silicon FPGA bring-up accessible to teams that cannot justify dedicated FPGA prototype hardware, by converting a capital expense into a per-hour operational cost. The open-source FireSim and Chipyard ecosystem is extending this capability to organizations building custom RISC-V silicon that previously had no pre-silicon firmware validation path. Enterprise prototyping systems are evolving to support multi-billion-gate designs — Cadence Protium X3 targets the AMD VP1902 Versal Adaptive SoC, the largest commercially available FPGA — enabling pre-silicon bring-up of the most complex SoC designs. The shift-left principle, already established in hardware verification, is completing its extension to firmware and OS-level software development.

Related Terms

pre-silicon firmware emulation, FPGA prototype, hardware emulation, shift left, RTL, tape-out, silicon bring-up, BSP, board support package, device tree, DTS, U-Boot, SPL, ATF BL2, Linux kernel bring-up, driver probe, DMA coherency, memory-mapped I/O, MMIO ordering, interrupt controller, GIC, reset value, QEMU, virtual platform, Fast Models, instruction-accurate simulation, Cadence Palladium, Cadence Protium, Synopsys HAPS, Synopsys ZeBu, FireSim, Chipyard, RISC-V, Makinote, AMD VP1902 Versal, Xilinx Alveo, Amazon EC2 F1, Yocto, PetaLinux, GSRD, JTAG, Arm Development Studio, multi-FPGA partition, ASIC cell, SoC, clock gating, SRAM macro, DDR PHY stub, behavioral model, hardware-software co-verification

Our Case Studies

Architecture for Automotive Fragrance Systems

Automotive & Transportation

Firmware Development, Hardware Design

Dual-MCU Railway BMU Architecture

Automotive & Transportation, Industrial Automation

Firmware Development, Hardware Design

AI Camera Platform for Vehicle Access

Automotive & Transportation

Hardware Design

Eight Charger Configurations, One Architecture

Industrial Automation, Energy

Firmware Development, Hardware Design

FPGA Security Platform with Post-Quantum Cryptography

Industrial Automation, Robotics & Drones

FPGA Design, Hardware Design

Isolated HV Power for AMB Control

Industrial Automation, Energy, Test & Measurements

Hardware Design

AI Photo Booth for Trade Show Lead Generation

Broadcasting & Media

Software Development, Hardware Design

Predictive Edge-AI Monitoring for Ventilation Systems

Industrial Automation, Smart Home, Safety Systems, Smart City

Hardware Design

Standalone Modular DAQ for Klaric

Automotive & Transportation

Firmware Development, Hardware Design

Network Switch for Data Acquisition System

Telecom & Networking, Industrial Automation

Dedicated Team, Firmware Development, Hardware Design, Industrial Design, Manufacturing

Enterprise Data Storage System Development

Broadcasting & Media

Hardware Design

OpenGear Cards for Multi-Camera Broadcasting System

Broadcasting & Media

Firmware Development, FPGA Design, Hardware Design

FAQ

What is the difference between hardware emulation and FPGA prototyping for pre-silicon firmware bring-up?

Hardware emulators like Cadence Palladium map SoC RTL onto custom emulation fabric that executes at 1–5 MHz with full debug visibility, any internal signal can be probed, system state can be saved and restored, and the platform is shared as expensive centralized infrastructure used across multiple projects. FPGA prototyping platforms map RTL onto commercial FPGAs and run at 5–50 MHz, fast enough for interactive Linux development, with less debug visibility but more accessible and deployable alongside the engineering team. Both provide full RTL accuracy that virtual platforms cannot match. FPGA prototypes are better suited to software bring-up speed, emulators are better suited to hardware-software co-debug where internal signal observation is required.

Why can QEMU not replace FPGA bring-up for firmware validation?

QEMU is instruction-accurate and models peripheral behavior through hand-written software approximations of register behavior, written against the same specification documents the firmware team uses. Bugs that arise from the actual RTL implementation, register reset values that differ from the specification, memory-mapped I/O access ordering requirements, DMA coherency sequences, interrupt controller register configurations, are invisible in QEMU because its peripheral models encode the specification, not the implementation. These bugs only appear when firmware executes against actual RTL, which requires either an FPGA prototype or hardware emulator.

What specifically happens during Linux bring-up on an FPGA prototype?

The sequence mirrors first-silicon bring-up: the first stage bootloader, SPL or ATF BL2, runs from ROM, initializes DDR DRAM, and loads U-Boot. U-Boot reads the device tree, initializes peripherals, loads the kernel image, and passes control. The kernel's driver framework then probes each peripheral described in the device tree by reading and writing its registers. Any mismatch between what the driver expects and what the RTL actually implements, wrong register address, wrong reset value, wrong interrupt number, produces a driver probe failure that is debuggable on the FPGA platform. The key difference from first silicon is that when a bug is found, the RTL fix requires only an FPGA recompile, not a silicon re-spin.

How does FireSim enable pre-silicon Linux bring-up for custom RISC-V SoCs?

FireSim generates FPGA bitstreams directly from synthesizable RTL, Chisel or Verilog, mapping the entire SoC onto cloud or on-premises FPGA instances. The simulated SoC runs at 10–100 MHz and executes real Linux, with validated models for DRAM, Ethernet, UART, and disk I/O. The simulated nodes are accessible via SSH to run software and collect cycle-accurate performance data. FireSim has been used in the development of commercially available silicon, and integrates with Chipyard so that the same RTL that runs in FireSim pre-silicon can be submitted for ASIC tape-out, ensuring the firmware developed on FireSim is the same firmware that runs on first silicon.