Proper acceptance testing, preventive maintenance, and a recovery plan go a long way toward eliminating facility failures. Below are a few painful lessons learned from the field.
This owner was performing generator testing with 100% block loading every week, causing insulation breakdown in the alternators. As a result, they experienced damage to multiple alternators leading to two generator failures and prompting testing of all their generators to identify the issue. While generator testing is necessary, owners must ensure it is performed according to the generator manufacturer’s guidelines and best practices to avoid damage to equipment.
While performing MV breaker maintenance in their data center, the critical load transferred to system bypass. Unfortunately, the 4,000 amp bypass breaker included a new trip unit that was improperly installed. The new trip unit was tested with secondary injection testing, which did not identify an issue where the breaker reverted to a 1,600 amp trip setting. In this breaker’s case, the actual load was 2,850 amps, causing the bypass breaker to trip on overload. Primary injection testing would have identified the problem, thus allowing for correction prior to a trip occurring.
Major Airline Delays Minor Maintenance
A major airline opted to employ the “run to failure” mode of maintenance, investing little to no money in electrical equipment maintenance. A small fire in a maintenance panel led to a major power outage that grounded flights worldwide for 24 hours. Additional flights were canceled or delayed for three more days. The incident received widespread press and affected the airline’s reputation and stock price in the short run. Was the money saved from the lack of maintenance worth the resulting costs?
Airline Service Rolls the Dice
An airline facility was constructed in the early 1990s. Since the facility never had a single utility power disruption, maintenance personnel could not convince management to spend the $450,000 needed to replace obsolete UPS modules. When the replacements finally happened, they realized there were not enough batteries left in the string to support an outage. While the facility never experienced a failure, the delay in replacing the UPS modules and batteries was a disaster waiting to happen. According to Data Center Dynamics,
“Batteries may be the most ‘low-tech’ components supporting today’s mission-critical facilities, but battery-related failures account for more than one-third of all UPS system failures over the life of the equipment. The continuity of critical systems during a power outage typically is dependent on a data center’s power equipment, comprised of UPSs and their respective battery backups. While the vast majority of outages last less than ten seconds, a single bad cell can cripple a data center’s entire backup system.”
TV Service Provider Rolls Back Budget
This tv service provider’s electrical equipment was outdated and needed upgrading, but upper management did not understand the importance of the upgrade and delayed the replacement project. After being purchased by another service provider, the electrical replacement budget got lost in the shuffle. A UPS battery failure dropped critical load causing a serious service disruption. Equipment failures don’t wait for decisions to be made.
Acceptance testing was performed for this internet service provider. Current transformers (CTs) and the overcurrent relays were tested, but the interconnect wiring was not tested via current injection. Unfortunately, the differential circuits were wired improperly, resulting in start-up delays because generators would not stay connected to the bus. If a component or system is critical, all aspects of testing need to be performed during startup and maintenance.
IT added equipment in the data center but failed to notify the Facilities department. As a result, the Facilities department did not plan an outage to verify proper A-B cording of the new equipment. When the UPS system failed, an investigation revealed that the new dual-corded loads were connected to the same power source. Remember, a quick scheduled outage is always preferable to an unplanned outage.
Due to budget and time constraints, a shortened mechanical commissioning period did not reveal issues with the system timer settings. In operation, the air-cooled backup chillers shut down 45 minutes after transfer, causing the IT equipment to overheat. In commissioning, money saved in the short run is spent times ten in the long run.
Final Thoughts: Maintenance and Planning are Key
Testing and maintenance are critical in a data center. However, the maintenance budget is usually the last approved and the first cut. Maintenance and operations personnel are often undertrained and afraid to operate their system. The gold standard of data center testing is the “pull the plug” test, where utility power is turned off to determine if the system operates as intended. Most facilities would never dream of performing this test as they do not have confidence in themselves or their system. Owners also fail to perform a root cause analysis when failures occur to determine lessons learned and identify necessary changes to prevent a reoccurrence. Ideally, if any failure occurs, a recovery plan should be in place, including limiting the damage, assessing the damage, prioritizing the corrective action, repairing or replacing what is needed, and determining any changes that need to occur going forward.