Shift Left for Firmware: How FPGA Emulation Compresses Linux Bring-Up Before Tape-Out
Silicon bring-up used to be a sequential process: design team tapes out, fabrication takes three to six months, the first package arrives on the lab bench, and only then does the firmware team start discovering that drivers are broken, the boot sequence hangs, and the device tree does not match what actually got implemented. At this point, every firmware bug found is found at maximum cost — the hardware cannot be changed without another tape-out, debug visibility is limited by what the silicon itself exposes, and the product schedule slips by the number of weeks it takes to chase issues that could have been found months earlier.
The shift-left answer to this is running the full firmware stack — bootloader, kernel, device drivers, application layer — on an FPGA prototype of the SoC RTL before the design goes to the foundry. The FPGA does not replace the chip; it runs the same register-transfer level description that will eventually be fabricated, at a fraction of the clock speed, on reconfigurable logic that can be modified when bugs are found. What the firmware team gets is a platform that behaves like the target silicon — same memory map, same peripheral registers, same interrupt topology — months before first silicon arrives.
Amlogic's engineering team, presenting at DAC, described exactly this trajectory: they needed to boot simple Linux kernels in minutes and full Android in hours on their pre-silicon platform, with support for UART and JTAG debug and the ability for multiple engineers to connect remotely. The Palladium emulation platform achieved 1 MHz execution speed and the Protium FPGA platform reached 5 MHz in fully automatic mode — both fast enough for OS-level software development at a fraction of the clock speed of production silicon, but running the actual chip RTL rather than a behavioral model.
Three Pre-Silicon Platforms and Their Tradeoffs
Pre-silicon software development runs on three distinct categories of platform, each at a different point in the accuracy-versus-speed tradeoff:
Virtual platforms (QEMU, Arm Fast Models) are instruction-accurate software simulators. They model the processor ISA and a set of peripheral behaviors in software running on the host CPU. QEMU can boot Linux for ARM, RISC-V, and other targets at close to native host speed because it translates target instructions to host instructions via just-in-time compilation. What it does not provide is cycle accuracy or RTL fidelity: the peripheral models are hand-written approximations of register behavior, timing does not match actual hardware, and bugs that arise from actual RTL implementation details — pipeline behavior, cache coherency interactions, bus arbitration timing — are invisible. Virtual platforms are valuable for early application software development before hardware design is complete, but they cannot find firmware bugs that arise from the real hardware implementation.
Hardware emulation platforms (Cadence Palladium, Siemens Veloce, Synopsys ZeBu) map the SoC RTL directly onto a custom emulation fabric that executes hardware logic at 1–5 MHz clock rates. These platforms achieve full RTL accuracy — the firmware sees exactly the same register behavior and timing as the eventual silicon — and provide extensive debug visibility including the ability to probe any internal signal, save and restore system state, and replay execution. They are expensive infrastructure (emulation systems cost millions of dollars), typically shared across multiple projects in a central EDA facility, and used for hardware verification as well as software bring-up. The practical speed of 1–5 MHz is sufficient to boot minimal Linux in minutes and run driver regression tests, but full Android boot takes hours.
FPGA prototype platforms (Cadence Protium, Synopsys HAPS, AMD Xilinx-based custom boards) map the SoC RTL onto one or more commercial FPGAs. They run at 5–50 MHz — substantially faster than emulators — and achieve RTL accuracy comparable to emulation, though with less debug visibility and more initial setup effort. Protium leverages the AMD VP1902 Versal Adaptive SoC, the industry's largest production FPGA, to support designs above 1 billion gates. The compilation flow is automated; Cadence's Protium can be brought up from a Palladium emulation design in a few days using the shared compilation flow. At 5–50 MHz, full Linux boot is achievable in minutes and Android boot in tens of minutes, enabling a workflow where firmware developers iterate with something close to real interactive development speed.
The following table compares the platforms across the dimensions that matter most for firmware bring-up:
| Platform | Speed | RTL fidelity | Debug visibility | Cost model | Best use |
| QEMU / Fast Models | Near-native host speed | ISA-accurate, peripheral models approximate | Software debug only | Free / low | Early driver scaffolding, application dev |
| Hardware emulator (Palladium) | 1–5 MHz | Full RTL | Full signal probe, state save/restore | Shared facility, expensive | Hardware-software co-debug, regression |
| FPGA prototype (Protium, HAPS) | 5–50 MHz | Full RTL | Limited, JTAG + UART | High upfront, accessible | Linux/Android bring-up, driver iteration |
| First silicon | 1+ GHz | Ground truth | Post-silicon debug tools | Tape-out cost + schedule | Validation, production qualification |
What the FPGA Prototype Runs
Mapping a custom SoC design onto an FPGA for pre-silicon firmware bring-up requires the design team to synthesize the RTL — the same files that will go to the foundry — into FPGA-specific configuration bitstreams. This is distinct from designing an FPGA-based product: the goal is not to optimize for FPGA implementation but to faithfully replicate the register and bus behavior of the target ASIC on available FPGA fabric.
Several constraints make this non-trivial. ASIC designs use memory primitives (SRAM macros, DDR PHY) that do not directly map to FPGA block RAM. The ASIC's clock frequencies, often 1–2 GHz, must be divided down to frequencies achievable by the FPGA implementation — typically 10–50 MHz for a complex SoC. Physical-only design elements — analog blocks, PLLs, I/O PHYs — must be stubbed out with behavioral models or replaced by FPGA-native equivalents. For multi-billion-gate SoCs that exceed any single FPGA's capacity, the design must be partitioned across multiple FPGAs with high-bandwidth interconnects between them.
The Barcelona Supercomputing Center's Makinote platform demonstrates the scale this requires for large designs: 96 AMD Xilinx Alveo U55C FPGAs interconnected to emulate up to 750 million ASIC cells, with specialized PCIe Gen4 and HBM interconnects between them. Their FPGA shell handles the inter-FPGA connectivity automatically, allowing design teams to port RTL to the cluster with minimal manual wiring work. At 32 FPGAs they demonstrated 8× performance improvement over a single FPGA for HPC workloads, and the platform is fast enough to boot Linux on a RISC-V SoC and run application-level benchmarks.
FireSim, the open-source FPGA-accelerated simulation framework from UC Berkeley, takes a different approach oriented toward research and commercial RISC-V SoC development. It runs cycle-accurate RTL simulations at 10–100 MHz on cloud FPGA instances (Amazon EC2 F1, Xilinx Alveo U250/U280), generating hardware models directly from synthesizable Chisel or Verilog RTL, and has been used in the development of commercially-available silicon. A 1024-node cluster simulation on 256 cloud FPGAs, each node running RISC-V RTL with a complete memory system and 200 Gbit/s Ethernet NIC model, demonstrates that the approach scales to full datacenter simulation — each node capable of booting Linux and running memcached at speeds that allow meaningful system-level performance measurements.
The Linux Bring-Up Workflow on FPGA
Linux bring-up on an FPGA prototype follows the same sequence as bring-up on first silicon, but with the ability to iterate that first silicon does not provide. The sequence is:
First stage bootloader (SPL / ATF BL2): the very first code that runs after reset, executing from on-chip ROM or the beginning of a boot device. It initializes DDR DRAM timing, configures clocks, and loads the secondary bootloader. On FPGA this stage must be adapted to match the FPGA's DDR controller model rather than the ASIC's DDR PHY, which is the most common source of early boot failures.
U-Boot: the secondary bootloader that initializes remaining peripherals, sets up device tree, loads the kernel image, and passes control to it. U-Boot's device tree-based peripheral discovery means that peripheral register addresses and interrupt numbers must exactly match the RTL implementation — mismatches here produce silent failures where peripherals initialize without error but do not function correctly.
Device tree generation from RTL handoff: Intel/Altera's SoC FPGA GSRD workflow generates device tree parameters directly from the Quartus hardware project handoff data, ensuring that the Linux device tree for the FPGA prototype matches the actual peripheral implementation. This is the mechanism that eliminates the class of firmware bug where the software was written against a specification document that diverged from what the RTL team actually implemented.
Linux kernel and driver bring-up: once U-Boot passes control to the kernel, driver initialization begins. Each peripheral driver writes to and reads from the register addresses described in the device tree, and any mismatch between the driver's expected register behavior and the actual RTL behavior manifests as a driver probe failure, an oops, or incorrect device behavior. On the FPGA prototype this failure is debuggable: the emulator or FPGA platform can probe internal signals at the point of failure, isolate whether the fault is in the driver or the hardware, and the hardware-side fix requires only an RTL change and FPGA recompile rather than a silicon re-spin.
The Amlogic case captures why this workflow compresses schedule: their pre-silicon platform let them run software driver debug and verification, verify driver functionality against actual RTL, and parallelize verification of different software drivers by partitioning the chip design across multiple emulation sessions — all before tape-out. The software team did not wait for silicon; they found and fixed hardware-software interface bugs while the hardware team still had the ability to correct the RTL.
Specific Classes of Bug That FPGA Bring-Up Finds
The value of pre-silicon FPGA bring-up is not generic "finding bugs earlier" — it is finding specific categories of bugs that only manifest when firmware actually executes against hardware registers. Understanding which bug categories these are clarifies why virtual platform development alone is insufficient.
Register reset-value bugs: many peripherals have specific expected reset values for their configuration registers that drivers read and validate before initialization. When the RTL implementation has a reset value that differs from the specification that the driver was written against, the driver either misidentifies the peripheral version or skips initialization steps it believes are unnecessary. These bugs require actual RTL execution to manifest — they are invisible in QEMU because the peripheral model was written from the same specification document as the driver.
Memory-mapped I/O ordering bugs: the ARM architecture allows out-of-order memory accesses unless specific barriers are inserted. When firmware writes to a peripheral register and immediately reads a status register to confirm the operation, an incorrectly placed memory barrier in the driver allows the CPU to reorder the transactions, producing a status read before the write has reached the peripheral. This is a well-known class of embedded Linux driver bug that only appears on real hardware or RTL-accurate simulation — QEMU serializes all peripheral accesses.
DMA coherency bugs: when a DMA engine writes to memory and the CPU subsequently reads that memory, the CPU's cache may return a stale value if the cache was not properly invalidated before the DMA transfer. Finding the exact point in the DMA controller RTL that requires the cache flush, and confirming the driver's cache maintenance sequence, requires running the actual DMA transactions against RTL-level hardware.
Interrupt topology mismatches: the interrupt controller configuration — which interrupt line each peripheral connects to, the active level or edge polarity, the priority configuration — must exactly match the device tree and driver configuration. A single GIC register misconfiguration produces symptoms ranging from no interrupt delivery to spurious interrupts that destabilize the system. These bugs are found by running the actual interrupt controller RTL with firmware that exercises interrupt-driven peripheral operation.
RISC-V and the Democratization of Pre-Silicon Bring-Up
Custom silicon development was historically the province of large semiconductor companies that could afford both tape-out costs and the expensive EDA infrastructure required for pre-silicon validation. The emergence of open-source RISC-V SoC design frameworks, open-source FPGA emulation tools, and accessible FPGA hardware is extending pre-silicon firmware bring-up to smaller design teams and to organizations building custom silicon for the first time.
Chipyard, the open-source RISC-V SoC construction framework from UC Berkeley, provides a complete path from RTL composition through FPGA emulation to ASIC tapeout, sharing the same RTL across all three execution environments. A custom SoC described in Chipyard can be simulated in Verilator for RTL verification, mapped to FireSim for FPGA-accelerated full-system simulation with Linux boot, and submitted to an open shuttle program (Google/efabless OpenMPW) for ASIC fabrication — all from the same design source. The Makinote cluster at BSC-CNS validated a RISC-V processor design by running HPC Challenge benchmarks on the emulated RTL running on 32 FPGAs, demonstrating that at-scale pre-silicon validation is achievable without proprietary emulation infrastructure.
The practical implication for embedded product teams building custom SoCs on RISC-V is that FPGA-based pre-silicon firmware bring-up is now accessible at a cost point significantly below the Cadence Palladium/Protium infrastructure tier. A development team with access to several Xilinx Alveo boards or cloud FPGA instances can run FireSim-based pre-silicon bring-up, boot Linux on their custom RTL, and iterate on BSP and driver code before tape-out, using the same open-source tools that academic groups have validated on commercially-taped-out designs.
On ARM-based custom SoCs, Intel/Altera's SoC FPGAs (Cyclone V, Arria 10, Agilex 7) provide an integrated path: the Arm Cortex-A hard processor system (HPS) on these devices executes the actual firmware, while the FPGA fabric implements the custom IP blocks from the design under development. The GSRD (Golden System Reference Design) and associated Yocto BSP tooling provide the reference bring-up environment. This is a less flexible emulation model than a pure-FPGA RTL implementation — the processor cores are the FPGA's hard cores rather than the design's target cores — but it is highly accessible and supports complete Linux boot with Yocto-generated images, JTAG debug via Arm Development Studio, and direct validation of custom peripheral IP.
Quick Overview
Pre-silicon firmware emulation on FPGA allows Linux kernel, bootloader, and device drivers to run against the actual SoC RTL before tape-out — finding register-level bugs, DMA coherency issues, interrupt topology mismatches, and device tree errors while the hardware is still modifiable. Hardware emulators (Cadence Palladium) execute RTL at 1–5 MHz with full signal visibility; FPGA prototype platforms (Cadence Protium, FireSim) execute at 5–50 MHz with sufficient speed for iterative Linux and Android bring-up. Neither replaces QEMU for early application development, and neither replaces first silicon for full-speed validation — they fill the critical gap where RTL-accurate firmware validation at software-development speed is required. The shift-left result: firmware bugs that previously required months-long silicon re-spin cycles to fix are found and corrected while the design is still RTL, before tape-out.
Key Applications
Custom SoC development for automotive, industrial, and broadcasting applications where silicon tape-out represents a $500K–$5M commitment and firmware readiness on day-one of silicon availability is a commercial requirement, RISC-V custom processor designs using Chipyard and FireSim to validate BSP and driver code before shuttle tape-out, embedded Linux product teams building application processors with custom peripheral IP where driver-hardware interface bugs are the primary post-silicon schedule risk, and any program where the software team and hardware team must develop concurrently rather than sequentially to meet product schedule.
Benefits
Firmware bugs found on FPGA before tape-out cost orders of magnitude less to fix than bugs found after first silicon: the FPGA fix is an RTL change and bitstream recompile; the post-silicon fix is a full re-spin. Pre-silicon FPGA bring-up allows firmware and hardware teams to work in parallel rather than sequentially, eliminating the months-long wait between tape-out and firmware team productivity. At 5–50 MHz on FPGA prototype platforms, Linux boots in minutes and drivers can be iterated at development pace. FPGA emulation accurately exposes register reset value bugs, memory-mapped I/O ordering requirements, and DMA coherency sequences that virtual platforms built from specification documents cannot reveal.
Challenges
Synthesizing production SoC RTL for FPGA requires adapting ASIC-specific memory macros (SRAM, DDR PHY), analog blocks, and high-speed I/O to FPGA-compatible equivalents, which is a non-trivial engineering effort that requires both RTL expertise and FPGA implementation knowledge. Multi-billion-gate SoCs that exceed a single FPGA's capacity require multi-FPGA partitioning with high-bandwidth inter-FPGA interconnects, adding compilation complexity and sometimes requiring manual timing closure work. The 5–50 MHz execution speed is sufficient for Linux bring-up but impractical for performance characterization or for software that has real-time requirements — those activities still require first silicon.
Outlook
Cloud FPGA instances (Amazon EC2 F1, equivalent offerings) are making pre-silicon FPGA bring-up accessible to teams that cannot justify dedicated FPGA prototype hardware, by converting a capital expense into a per-hour operational cost. The open-source FireSim and Chipyard ecosystem is extending this capability to organizations building custom RISC-V silicon that previously had no pre-silicon firmware validation path. Enterprise prototyping systems are evolving to support multi-billion-gate designs — Cadence Protium X3 targets the AMD VP1902 Versal Adaptive SoC, the largest commercially available FPGA — enabling pre-silicon bring-up of the most complex SoC designs. The shift-left principle, already established in hardware verification, is completing its extension to firmware and OS-level software development.
Related Terms
pre-silicon firmware emulation, FPGA prototype, hardware emulation, shift left, RTL, tape-out, silicon bring-up, BSP, board support package, device tree, DTS, U-Boot, SPL, ATF BL2, Linux kernel bring-up, driver probe, DMA coherency, memory-mapped I/O, MMIO ordering, interrupt controller, GIC, reset value, QEMU, virtual platform, Fast Models, instruction-accurate simulation, Cadence Palladium, Cadence Protium, Synopsys HAPS, Synopsys ZeBu, FireSim, Chipyard, RISC-V, Makinote, AMD VP1902 Versal, Xilinx Alveo, Amazon EC2 F1, Yocto, PetaLinux, GSRD, JTAG, Arm Development Studio, multi-FPGA partition, ASIC cell, SoC, clock gating, SRAM macro, DDR PHY stub, behavioral model, hardware-software co-verification
Our Case Studies
FAQ
What is the difference between hardware emulation and FPGA prototyping for pre-silicon firmware bring-up?
Why can QEMU not replace FPGA bring-up for firmware validation?
What specifically happens during Linux bring-up on an FPGA prototype?
How does FireSim enable pre-silicon Linux bring-up for custom RISC-V SoCs?











