For data centers, preventive maintenance and assessments are key to the disaster planning process.
When we think of the disaster planning process, one key area that continues to deserve attention is data center availability. With today’s heavy reliance on technology and automated systems, disruptions to electrical and cooling systems in the data center can have severe impacts on the business.
Ensuring availability of electrical and cooling infrastructure is crucial because data centers are dynamic environments. Heat loads are constantly increasing, challenging cooling strategies; computer equipment is often moved or changed without thinking about the underlying support strategy; technology shifts and blade servers are creating extreme heat densities and hot spots; and abandoned cabling has reduced adequate airflow for cooling systems.
If a well-orchestrated assessment and maintenance strategy is implemented, the business effects of a disaster can be minimal. Business continuity depends on the data center manager’s awareness of potential disasters, their ability to develop a plan to minimize disruptions of critical functions, and the capability to recover operations quickly and successfully. A well-planned process should minimize the disruption of operations and ensure some level of organizational stability and an orderly recovery after a disaster.
Two key areas that should be part of any data center disaster planning strategy are assessments and comprehensive preventive maintenance.
Data Center Assessments
Data center assessments are designed to expose vulnerabilities in your underlying availability strategy posed by problems within the cooling or electrical systems. Assessments analyze the cooling processes for sensitive, heat-generating computer equipment and identify hot spots and other risks to the data center. They also evaluate electrical system capacity and the quality of power provided in the data center, identifying electrical risks that could create downtime.
A comprehensive assessment allows data center managers to:
- Identify unwanted hot spots to avoid degradation of mission-critical equipment and critical data
- Pinpoint and resolve power problems in the data center, including electrical spikes, sags, surges, harmonic
distortion, and more
- Ensure proper electrical system capacity and power quality supporting the data center
- Maximize efficient use of electrical systems, cooling, and floor space
- Uncover infrastructure design issues through use of failure scenarios
and single-point-of-failure analysis
- Reduce operating costs of the data center by employing recommended strategies
An experienced technician with industry-leading assessment tools will identify the gaps in infrastructure availability and make recommendations for improvement. A data center assessment typically includes onsite cooling and electrical assessments, documentation of findings and recommendations, floor plans and equipment lists, computational fluid dynamics (CFD) modeling of current state and recommended solutions, assessment of potential for expansion, and an in-person review of report findings.
Over time, the changes in a data center can create risks to business-critical continuity. Sometimes, even new facilities experience nagging power or cooling problems that create risks for continuity or interfere with facility performance.
An electrical assessment evaluates electrical infrastructure availability, capacity, and power quality for the data center. An electrical assessment will provide the data center manager with information on single points-of-failure and any potential power issues such as harmonic distortion, voltage regulation, and load imbalance. Most importantly, electrical assessments provide recommendations to maximize system availability both now and in the future.
A typical electrical assessment includes:
- Performing a single point-of-failure analysis, which will
identify critical failure points in the system
- Determining capacity and appropriate settings of all
switchgear from the main to mission-critical power
distribution units (PDUs)
- Performing analysis comparing measured current and
power rating for all breakers from the main to mission-
critical PDUs as well as any imbalances, noting any areas
- Determining kilowatt and kilovolt-ampere loading on each
Uninterruptible Power Supply (UPS), compared to the UPS rating
- Evaluating the rated capacity of each generator versus UPS rated
capacity and noting if generator full-load rating is less than the
recommended 150% of UPS rating
- Performing a harmonic snapshot at the main breaker switchgear as
well as the load side of each UPS and identifying any anomalies
- Determining whether breakers are properly labeled
A cooling assessment evaluates the data center’s performance as it relates to proper equipment heat rejection. It generally consists of a floor plan of the facility; a report showing airflow characteristics and computer room air conditioning (CRAC) unit performance; and recommendations for improvement when it comes to eliminating hot spots, improving air flow, and operating efficiently.
A typical cooling assessment includes:
- Taking temperature readings at critical points throughout the data center
- Identifying hot spots and recommendations for how to eliminate them
- Taking air flow measurements—identifying raised-floor air patterns, under-floor obstructions, and air flow
through computer racks
- Comparing equipment load with system cooling capacity
- Providing a floorplan showing the location of existing equipment, server racks, and air flow obstructions
- Performing CFD modeling showing the temperature and air flow characteristics of the space
Preventive Maintenance Program
A preventive maintenance (PM) program is also critical to the disaster planning process. One way end-users can minimize infrastructure-related failures is to implement a PM program supported by original equipment manufacture (OEM) trained and certified technicians.
The Case for Maintenance
Preventive maintenance is a proactive method to protect business continuity. As organizations become increasingly dependent on data center systems, there is a need for greater reliability in the critical power system. For many organizations, the information technology (IT) infrastructure has evolved into an interdependent business-critical network that includes data, applications, storage, servers, and networking. A power failure at any point along the network can impact the entire operation—and have serious consequences for the business.
Relying on minimal or no preventive maintenance greatly increases the chance that business operations will be disrupted if the power equipment fails, thus exposing a business to revenue loss, reducing productivity, and affecting customer satisfaction and loyalty. Additionally, unexpected costs will be incurred for repairs and replacements. Taking a preventive approach can greatly reduce the probability of large, unexpected capital expenses required to repair or replace important components and/or equipment not properly maintained.
When equipment is not maintained, especially in adverse conditions such as dirty environments and/or high temperatures, it can result in system deterioration up to and including load loss. Maintenance programs maximize the reliability and performance of the electrical systems on which organizations depend on to keep critical systems running.
When correctly implemented, PM programs ensure maximum reliability of data center equipment by providing systematic inspections and detection and correction of incipient failures, either before they occur, or before they develop into major defects that could translate into costly downtime. Typical maintenance programs include inspections, electrical tests and measurements, adjustments, implementation of field updates, and evaluation of housekeeping practices.
Preventive maintenance has a number of benefits for the end-user. First, better reliability is delivered by early detection and correction of routine wear-and-tear issues. Other benefits include extending the product life and minimizing capital expenditures for the equipment. In addition, routine maintenance provided at a fixed cost aids in budget planning and avoids unexpected expenses.
Begin With the UPS
To keep running through outages, electrical spikes, and other power problems, critical systems are dependent on the reliability of the UPS system. Therefore, keeping these systems in working condition is crucial.
While the UPS systems are designed to offer the utmost reliability and performance at an affordable price, they are not failure proof. Factors such as application, installation, design, real-world operating conditions, and maintenance practices can impact the reliability and performance of the UPS systems.
The reliability of a system is often only as good as the weakest component in the unit and is dependent upon the inherent reliability of the component parts comprising the system. Some manufacturers address this issue by reducing the unit parts count, thus decreasing the chance of a failure. However, the reality is failures still occur; therefore, being proactive with maintenance can provide early detection and greatly reduce your chances for downtime.
The frequency of PM visits depends on the type of UPS deployed in the organization. Small UPS devices, single-phase units, and three-phase units rated less than 12 kVA should be inspected annually to ensure alarms, filtering, and internal batteries are all operating within specifications. For medium and large UPS devices, three-phase units rated more than 12 kVA, which most likely include ancillary equipment, it’s recommended that inspection and maintenance take place at least twice a year to ensure proper performance and confirmation that the system is operating within the manufacturer’s specifications.
A recent Emerson Network Power study of the impact of preventive maintenance on UPS reliability revealed that the mean time between failures (MTBF) for units that received two preventive maintenance service visits a year is 23 times better than a UPS with no preventive maintenance visits. According to the study, reliability continued to steadily increase with additional visits when conducted by factory-trained engineers.
Typical tasks performed during a semi-annual PM visit include:
- Checking all breaker connections and associated controls.
Repairing and reporting all connection concerns
- Completing visual inspection of the equipment, including sub-assemblies, wiring harnesses, contacts, cables,
and major components
- Checking air filters for cleanliness
- Checking rectifier and inverter snubber boards for discoloration
- Checking power capacitors for swelling or leaking oil and direct current (DC) capacitor vent caps that have
extruded more than one-eighths inch
- Recording all voltage and current meter readings on the module control cabinet or the system control cabinet
- Measuring and recording harmonic trap filter currents
Typical tasks performed during an annual service call include all the tasks done during a semi-annual visit, plus the following:
- Checking all nuts, bolts, screws, and connectors for tightness and inspecting for heat discoloration
- Testing fuses on the DC capacitor deck (if applicable)
- With customer approval, operational test of the system including unit transfer and battery discharge
- Implementing any Engineering Field Change Notices (FCN) as needed
- Measuring and recording all low-voltage power supply levels
- Measuring and recording phase-to-phase input voltage and currents
- Reviewing system performance with customer to address any questions and to schedule repairs or upgrades if
Remember the Batteries
Battery maintenance begins with proper installation of the system. Batteries must be fully charged, battery room conditions must be verified, and baseline readings must be recorded for proper trend analysis throughout the life of the battery. If this information is not properly gathered and documented, future detection of bad batteries could prove to be difficult.
For best practices for battery maintenance, refer to the manufacturer’s recommendations, the IEEE-1188 for Valve Regulated Lead Acid (VRLA) batteries and the IEEE-450 for Vented Lead Acid (VLA or flooded) batteries. However, best practices do not always equate to common practices. Governed by real-world factors, many facility managers are often forced to take into account the cost of performing the recommended IEEE schedule as it relates to the criticality of the application. Table 1 represents a typical PM schedule for both VLA (Flooded) and VRLA batteries.
High ambient temperature and unusually frequent discharges (or over-cycling) are most commonly responsible for reducing useful life across all types of batteries. Dryout is the most common cause of VRLA battery failure. Battery aging accelerates dramatically as ambient temperature increases. This is true of batteries in service and in storage. Even under ideal conditions, batteries are designed to provide a limited number of discharge cycles during their expected life. While that number may be adequate in some applications, there are instances where a battery can wear out prematurely.
Other factors that can cause premature battery failure include:
- High- or low-charge voltage
- Overcharging or frequent equalizing charges
- Loose or strained connections
- Dirt & corrosion—parasitic/ground faults
- Manufacturing defects
- Keeping failing cells within a string of good cells
- Manufacturing defects
Once a battery is operating properly, it’s important to proactively monitor daily battery performance trends to help detect potential battery failures. A battery monitoring system provides a continuous watch of the battery to assess its true state of health. Instead of waiting for an inevitable failure or replacing batteries prematurely to prevent problems, battery monitors allow organizations to continue to utilize their batteries longer and with confidence by knowing the true condition of all critical battery parameters such as cell voltage, internal resistance, cycle history, overall string voltage, current, and temperature.
While there are many battery services available, the best solution to maximizing battery performance is to utilize an integrated battery monitoring service that combines state-of-the-art battery monitoring technology with proactive maintenance and service response. This type of proactive solution integrates onsite and remote PM activities with predictive analysis to identify problems before they occur.
Author's Bio: Ben Kissell is service solutions manager at Emerson Network Power’s Liebert Services.
HTMLOutput: Object reference not set to an instance of an object.
HTMLOutput: Object reference not set to an instance of an object.HTMLOutput: Object reference not set to an instance of an object.