Why Traditional BESS Monitoring Leaves Critical Safety Gaps — and How Edge AI Expands Coverage Beyond the BMS

Why BESS Monitoring Focused on the BMS Misses Most Failure Causes

 

When a battery energy storage system fails, the public narrative almost always starts in the same place: thermal runaway, faulty cells, lithium-ion chemistry. It is a narrative that fits the visible drama of a fire but consistently misrepresents where the actual failure originated. EPRI's analysis of its BESS Failure Incident Database — the most comprehensive public record of stationary storage failures — found that only 11 percent of failures trace to cell defects. The rest involve integration, assembly, operational factors, and balance-of-system components: HVAC, wiring, fire suppression, cooling systems, controls. Not the cells.

This distinction matters because the monitoring architecture deployed on the majority of BESS assets today is designed almost entirely around the cell layer. The BMS watches voltage, current, and temperature at cell and module level. Cloud analytics model degradation curves from that same data. The assumption embedded in this architecture is that if the cells are within spec, the system is safe. EPRI's data says otherwise. And the operational consequence is that most BESS sites are running with significant blind spots in the systems that actually drive failures — HVAC health, container environment, humidity, condensation risk, and off-gas detection.

Your BMS Can’t See the Failures
That Matter Most

Traditional BMS data cannot see HVAC degradation, humidity spikes, condensation risk, or early off-gas patterns. Promwad’s edge AI platform adds continuous monitoring for the battery-adjacent systems that standard BESS monitoring misses — without modifying the BMS, SCADA, or OT network.

 

What the BMS Can and Cannot See

A battery management system is purpose-built for one job: protecting the electrochemical integrity of the cell stack. It measures cell voltages, balances charge across modules, monitors string temperatures, tracks state of charge and state of health, and enforces protection limits through contactor control. For this job, a modern BMS is highly capable. It is the right tool for managing the cell layer.

The problem is what sits outside that layer. A BESS container is a complex electromechanical environment with multiple interdependent systems, each capable of contributing to failure independently of cell condition. The BMS has no inputs from any of them. It does not measure HVAC compressor efficiency, refrigerant charge level, or airflow distribution across racks. It does not track humidity or dew point inside the container enclosure. It does not correlate gas sensor outputs against environmental baselines to distinguish genuine off-gas events from HVAC cycling artifacts. And it cannot detect the slow, months-long decline in cooling performance that silently accelerates battery aging before triggering any cell-level alarm.

The monitoring gap is not a design flaw in the BMS — it is a scope boundary. A BMS was never intended to monitor balance-of-system health. The problem is that O&M programs and monitoring architectures frequently behave as though it was.

The practical effect becomes visible in the failure data. EPRI's 2024 analysis found that in the last three years of documented incidents, every failure that could be categorized by failed element involved either controls or balance-of-system components — including HVAC, liquid cooling systems, and enclosure infrastructure. None traced to cells or modules in that period. The assumption that comprehensive cell monitoring equals comprehensive site safety monitoring is directly contradicted by the incident record.

The Pre-Runaway Signal Chain Most Sites Never See

Thermal runaway is typically described as the event to be prevented. In practice, it is the end point of a degradation chain that begins much earlier, in systems that standard monitoring does not cover. Understanding that chain helps clarify why earlier-stage monitoring has operational value that BMS threshold alarms cannot provide.

A representative pre-runaway progression in an HVAC-related failure follows this sequence:

  • Compressor performance degrades gradually over months due to refrigerant loss, bearing wear, or fouling. Coefficient of Performance drops from the commissioning baseline but stays above alarm thresholds.
  • Cooling capacity declines. Cell temperatures begin running slightly higher than optimal — within BMS limits, but above the optimal operating range for cycle life.
  • Battery degradation accelerates. The Arrhenius relationship between temperature and degradation rate means that sustained operation 10°C above optimal roughly doubles the aging rate. Asset life shortens without a single alarm being generated.
  • A stress event — a hot ambient day, a high-charge cycle, a temporary cooling failure — pushes temperatures to the threshold where cell chemistry becomes unstable.
  • Off-gas generation begins. VOC and hydrogen concentrations rise inside the container. Gas sensors, if present, register the change — but most sites have binary threshold alarms that produce frequent false positives from HVAC cycling, creating alarm fatigue that causes operators to discount the signal.
  • Without a high-confidence early warning, the 5 to 20-minute window between detectable off-gas and thermal runaway onset is not acted on effectively.

Each step in this chain is detectable before the next one occurs. None of them are visible to the BMS. The monitoring architecture that could catch the progression exists in concept — continuous HVAC health scoring, COP trending, humidity and condensation monitoring, pattern-based gas analytics — but it is not installed on most sites. The standard model is periodic inspections, threshold alarms, and BMS data. That combination does not close the detection gap.

Why Humidity and Condensation Deserve More Attention Than They Get

Water ingress and condensation are documented causes of BESS fires, but they receive less systematic attention than thermal events because they are harder to attribute clearly after the fact. The causal pathway is typically indirect: humidity rises inside the container, condensation forms on electrical components or busbars during HVAC cycling, insulation resistance declines, and a fault develops that initiates an electrical event.

The Korean ESS fire investigations from 2017 to 2019 — the largest concentrated cluster of BESS incidents in the EPRI database — identified condensation combined with dust contamination as a contributing factor in insulation breakdown. DNV's subsequent analysis confirmed condensation from faulty humidity control as a direct fire cause. The Moss Landing incident in California involved water ingress as a contributing factor. These are not edge cases; they are documented pathways that appear in independent post-incident analyses across multiple jurisdictions.

The engineering reality inside a BESS container makes humidity management challenging. HVAC cycling creates temperature differentials that produce humidity spikes. Seasonal transitions introduce dew point conditions that can generate condensation on surfaces that are far below ambient temperature due to the thermal mass of the battery modules. Standard BMS hardware has no humidity sensors. Periodic inspections catch gross moisture damage but cannot detect the HVAC cycling humidity spikes that occur dozens of times per day.

The key metrics for continuous environmental monitoring in a BESS container cover three areas:

  • Relative humidity at multiple heights inside the container, since humidity stratifies
  • Dew point relative to the surface temperature of critical electrical components
  • Condensation probability as a derived metric that accounts for both humidity and thermal gradient, providing an actionable risk indicator rather than raw sensor values

None of these require BMS integration. They require dedicated sensors and a local processing layer capable of running the condensation probability calculation continuously.

The Off-Gas Window and the Alarm Fatigue Problem

Off-gas detection is the only known method that provides advance warning before thermal runaway becomes unstoppable. The detectable gas species — VOC compounds, hydrogen, CO, CO₂ — begin appearing 5 to 20 minutes before the runaway event, giving operators a window to initiate emergency response. That window is the most valuable safety margin available in a BESS environment where fire suppression systems are designed to contain, not prevent.

The reason this window is frequently wasted is not sensor absence but analytics absence. Most BESS sites install gas detectors as required by NFPA 855 and local fire codes. Those detectors output binary threshold alarms. A threshold alarm treats any reading above a fixed concentration level as an event — it does not know whether the reading is caused by genuine cell off-gassing, a cleaning product used during maintenance, sensor drift, or the humidity and VOC fluctuations that accompany HVAC cycling. The result is a high false-positive rate that trains operators to treat alarms as noise.

The difference between a binary threshold alarm and a pattern-based analytics layer is the difference between an alarm that operators have learned to ignore and an alert they can act on with confidence. Pattern recognition across multiple gas species — correlating VOC, H₂, CO, and CO₂ signals simultaneously — can distinguish genuine thermal runaway precursors from environmental interference. A multi-gas correlation that shows simultaneous VOC and H₂ elevation with a characteristic time signature has a fundamentally different risk profile than a single-sensor threshold crossing during an HVAC cycle.
 

bess


Where Edge Processing Fits — and Why Cloud Dependency Is the Wrong Architecture for Safety

The case for edge AI in BESS monitoring is not about AI as a general principle. It is about the specific requirements of safety-critical decision-making in an industrial environment where communication links cannot be assumed to be available, and where the latency of a cloud round-trip is incompatible with sub-second response requirements.

A BESS site may be in a location with limited connectivity. The OT network that connects BMS and SCADA systems is typically isolated from external networks for cybersecurity reasons. An architecture that routes safety-relevant sensor data to a cloud platform, runs inference, and returns an alert introduces multiple failure points: network availability, cloud platform availability, and latency that could consume a meaningful fraction of the available 5-to-20-minute off-gas window.

On-site edge processing eliminates these dependencies. The inference runs locally, the alert is generated locally, and the response decision is made at the asset level regardless of connectivity state. For the specific case of thermal runaway early warning, this is not an architectural preference — it is a functional requirement.

The deployment argument for edge AI is also different from that of traditional monitoring upgrades. A conventional approach to expanding BESS monitoring beyond the BMS typically involves modifying the BMS integration, adding sensors to SCADA, engaging the OEM for firmware changes, and going through IT approval processes for new network endpoints. For operators managing multi-vendor fleets across multiple sites, this is a multi-month engagement per site, with outcome uncertainty at each step.

An OEM-agnostic edge platform deploys dedicated sensors and a local edge computing unit without touching the BMS, SCADA, or OT network. No BMS integration is required. No SCADA modification is required. No IT approvals are triggered by changes to the OT architecture. The platform monitors the adjacent systems — HVAC, environment, gas — independently, and can be operational at a new site in 90 days. For operators who want to validate the approach on a single asset before committing to a fleet deployment, this pilot model is a practical on-ramp that the traditional integration model does not offer.

The monitoring gap that EPRI's failure data describes is real, documented, and addressable with technology that exists today. The reason it persists is not technical — it is architectural and operational. Sites are designed around the assumption that BMS coverage is sufficient coverage, and the infrastructure to challenge that assumption has not been deployed at scale. The incident record suggests it should be.

Quick Overview

Most BESS safety monitoring is concentrated at the cell layer, where BMS systems track voltage, current, and temperature. EPRI's 2024 analysis of documented failures shows that this coverage misses the systems responsible for the majority of real-world incidents: HVAC and cooling infrastructure, container environmental conditions, and balance-of-system integration. Edge AI monitoring platforms address this gap by deploying dedicated sensors for HVAC health, humidity, off-gas detection, and thermal gradients, with on-site processing that eliminates cloud dependency for safety-critical decisions.

Key Applications

Utility-scale and C&I BESS assets requiring monitoring coverage beyond BMS cell data, multi-vendor BESS fleets where OEM-agnostic deployment is needed without BMS or SCADA integration, sites with documented HVAC degradation or high ambient humidity exposure, operators seeking NFPA 855-compliant off-gas analytics with reduced false-positive rates, and new BESS projects in the first two years of operation where failure rates are documented to be highest.

Benefits

Continuous HVAC health scoring and COP trending detects cooling degradation months before it affects cell temperatures or triggers BMS alarms. Pattern-based multi-gas analytics reduces alarm fatigue by distinguishing genuine off-gas events from environmental interference. Real-time condensation probability monitoring addresses a documented fire cause that standard BMS hardware cannot see. Deployment without BMS integration or SCADA modification removes the primary barrier to rapid site-by-site rollout.

Challenges

Adding sensor infrastructure to existing BESS containers requires physical access and installation coordination with site operators. Edge platform baseline learning requires a commissioning period to establish site-specific normal operating patterns before anomaly detection becomes reliable. Integrating edge alert outputs into existing O&M workflows and escalation procedures requires organizational change alongside the technical deployment.

Outlook

BESS deployment continues to scale globally, driven by grid-scale renewable integration and capacity markets. As fleet sizes grow, the operational and insurance cost of preventable failures grows proportionally. Regulatory and insurance frameworks — including evolving NFPA 855 requirements and insurer demands for documented hazard mitigation — are creating pressure to expand monitoring coverage beyond BMS data. Edge AI platforms that deploy without OT network modification are positioned to become a standard layer of BESS O&M infrastructure alongside existing BMS and SCADA systems.

Related Terms

BESS, BMS, thermal runaway, off-gas detection, HVAC monitoring, COP tracking, condensation risk, edge AI, EPRI failure database, NFPA 855, balance of system, predictive maintenance, VOC detection, hydrogen detection, IEC 62933, DNV BESS safety, compressor health monitoring, OT network, on-site inference

 

Contact us

 

 

Our Case Studies

 

FAQ

What percentage of BESS failures are caused by battery cells rather than other system components?

EPRI's analysis of its BESS Failure Incident Database found that only 11 percent of failures traced to cell defects. The majority involve balance-of-system components including HVAC, cooling systems, electrical wiring, and fire suppression infrastructure, as well as integration, assembly, and operational factors. This finding directly contradicts the common assumption that cell-level monitoring provides adequate safety coverage for a BESS asset.
 

Why is BMS data insufficient for comprehensive BESS safety monitoring?

A BMS monitors cell voltage, current, temperature, and state of charge — parameters directly related to electrochemical condition. It has no sensors for HVAC performance, container humidity, dew point, condensation risk, or off-gas concentrations. Since the majority of documented BESS failures originate in balance-of-system components and environmental conditions rather than cells, a monitoring architecture limited to BMS data systematically misses the failure modes responsible for most incidents.
 

How does edge AI improve off-gas detection in battery energy storage systems?

Standard gas detectors in BESS containers produce binary threshold alarms that generate high false-positive rates from HVAC cycling, environmental variation, and sensor drift. Edge AI applies pattern recognition across multiple gas species simultaneously — correlating VOC, hydrogen, CO, and CO₂ signals — to distinguish genuine thermal runaway precursors from environmental interference. This produces higher-confidence alerts with fewer false positives, making the 5-to-20-minute pre-runaway warning window actionable rather than lost to alarm fatigue.
 

What is the role of condensation monitoring in BESS fire prevention?

Condensation on electrical components is a documented cause of insulation breakdown and subsequent fire in BESS installations, confirmed in Korean ESS fire investigations and DNV post-incident analysis. HVAC cycling creates temperature differentials that produce humidity spikes and condensation on surfaces cooled by the battery thermal mass. Continuous monitoring of humidity, dew point, and condensation probability on critical surfaces detects this mechanism in real time — a capability absent from standard BMS hardware.