How to Manage Server Rack Temperatures and Prevent Downtime
Overheating kills electronics including servers, network switches, IT peripherals and UPS batteries. As we all know heat rises and so it makes sense to put the most heat sensitive items at the bottom of a server rack. Within a server room or data centre environment, the amount of power being drawn is high enough for temperature hot spots to reach critical temperatures at which point there is a real risk of fire and catastrophic failure.
Hot spots are caused by a combination of inadequate air flow and overall equipment layout and can be common within server racks as well as the rooms within which the racks are installed. Temperature should be monitored at the rack level using an environment monitoring system or at the very least intelligent PDUs with additional temperature sensors. Temperature spikes can indicate a failing component and should be investigated immediately but what other actions can data centre managers take to reduce cooling related downtime?
Server Rack Ventilation and Population
As with most IT server related devices, network switches have a high mean time between failure (MTBF), and failures are rare. When network switches fail, downstream IT equipment connected to them goes off-line. If the failure also leads to a power surge, this can also damage the power supplies and ports of connected devices.
There are generally two types of network switches: End-of-Row (EoR) switches and Top-of-Rack (ToR). End-of-Row (EoR) switches are easier to swap-out but are more expensive to purchase and install due to the additional network wiring required. Data centre managers tend to use Top-of-Rack switches because they require less wiring but if one overheats this can lead to the loss of an entire rack.
It is therefore vital to ensure that racks are properly ventilated and monitored to avoid thermal hot-spots and help to prevent the potential for a network switch failure. What makes this challenging is that switches tend to be mounted in server racks backwards, to make it easier to access the ports via the maintenance aisle. This puts their front air-intake further away from the cool air provided to the front of the racks and typically via performance raised access floor tiles.
The best practice solution is therefore to install switches at the top of server racks and orient them towards the front of the rack cabinet. This means that their orientation matches that of the other devices within the server rack and allows the switches to draw cool air in via their ventilation fans and expel the hot air into the rear exhaust channel into the hot aisle.
Most UPS systems have high operating efficiencies and these in on-line mode can be around 95% or greater for a transformer-less system. Rack mount UPS systems are designed for installation within a server rack and their high efficiency means that they should not add significantly to the cooling requirements. The weight of a typical UPS system with its battery set also means that the uninterruptible power supply should be installed at the bottom or near to the bottom of the server rack so as not to affect the rack’s centre of gravity.
In between the network switches and UPS system (if installed) is the space provided for the servers themselves. Best practice also calls for additional temperature monitoring within the server rack at 6-points including the top, middle and bottom of both the front and rear in order to identify potential hot-spots.
Aisle Containment for High Density Racks
High density racks typically have from 10-30kW of power consumption rather than the 3-5kW of a typical server rack. If there is a cooling failure within or for a high density rack the recovery time to bring on-line additional computer room air conditioners (CRACs) can be as little as three minutes if a meltdown and fire is to be avoided.
Cooling requirements are a function of rack power densities. As the power demand increases so does the need to remove the heat built-up from the operation of the server CPUs (central processing units). The more the power and equipment within a server rack, the greater the potential for thermal hot-spots to occur and this is especially so within high-density racks. Hot-spots from 30˚C and above can be catastrophic and lead to downtime.
Active containment of the hot-aisles and cold-aisles is one way to manage cooling for high-density racks. The arrangement uses over cabinet fans and containment chamber to facilitate cold air flow into the front of the cabinets and hot air flow from the rear of the server cabinets into return plenums. This helps to prevent hot-spots and heat build-up.
An active containment system monitors air pressures in the chamber adjust the fan speeds to prevent hot-air from building up around the racks when they are drawing the most power during peak utilisation periods. The arrangement also reduces heat in the maintenance aisle to provide a more comfortable working environment for data centre technicians and engineers as the hot-air is pulled up into the return plenum. An active containment system can also provide additional time to bring additional cooling online if a CRAC unit fails.
The Implications of ASHRAE on PUE
In order to improve power usage effectiveness (PUE), ASHRAE recommends operating servers within higher ambient temperatures. This reduces cooling needs and in turn the electricity demand by the cooling system. Active containment and airflow management can therefore play a pivotal role in not only reducing hot-spot related risks but also energy efficiency. This does however put a server rack at greater risk of downtime as critical temperature limits will be reach quicker if there is a cooling failure.
The Green Grid’s Performance Indicator metric therefore recommends two additional metrics to monitor in addition not PUE:
- IT Thermal Conformance to monitor the percentage of IT equipment operating within ASHRAE’s recommended temperature range whilst cooling systems are fully functioning.
- IT Thermal Resilience to monitor the percentage of IT equipment operating within the ASHRAE recommended temperature range after a cooling systems failure.
More info: https://www.datacenterknowledge.com/whitepapers/using-green-grid-performance-indicator-tool-achieve-data-center-reliability-energy
Whilst ASHRAE recommends pushing up server room temperatures to improve energy usage this can present down-the-line issues in terms of working life performance for critical components within a server room or data centre environment.
Summary
There are several aspects to consider in relation not server racks. It is important to size a server rack to ensure it can be populated with the required equipment and provide enough space for cooling and expansion. The larger the number of items within the rack, the greater the need for cooling and the potential for hot-spots. It is therefore vital to monitor temperatures and manage air flow to prevent hot-spots and reduce the potential for heat-induced downtime.