Which Data Centre Systems Require Regular Maintenance?
The infrastructure systems within a server room or data centre are classed as either critical, essential or non-essential. The three most common critical infrastructure systems are power, cooling and fire suppression, followed by environmental monitoring and security systems. Regular inspections and preventative maintenance are required to ensure that these infrastructure systems are reliable and available 24/7.
What is Preventative Maintenance?
Any maintenance contract or service plan is a proactive approach to maintaining the overall operability of a system. Preventative maintenance visits (PMVs) are carried out to inspect, test and verify operating parameters, check history and alarm logs, check and if necessary, tighten terminals and connections, update firmware, check calibrations, and replace consumable items in line with system manufacturer recommendations. The work should only be carried out by manufacturer certified engineers to ensure access to the latest software tools, documentation and manufacturer approved consumables and spares.
Critical Infrastructure Systems
Critical infrastructure systems ensure that a server room or data centre is available, regardless of external factors such as a power outage, whilst providing a secure, temperature and humidity-controlled environment. Critical systems therefore include:
- Cooling: HVAC and Air Conditioners, CRACs, CRAHs and Chillers
- Power: LV switchboards, uninterruptible power supplies, standby generators, PDUs and energy storage systems
- Fire Suppression: room level and rackmount where installed
- Environment Monitoring: which can consist of one more environment monitoring base units with sensors for temperature, humidity and water leakage and other environmental factors including smoke detection
- Security systems: which can include both room and rack level access control, and CCTV systems
Critical power and cooling systems may be designed and installed in configurations to achieve a specific Tier-rating from the Uptime Institute. This provides a measure of the availability and maintainability of the overall data centre design.
Critical Infrastructure Assets and Maintenance Logs
Some sites can provide us with a complete asset list and maintenance log. This makes it easy for us to carry out a site survey and provide a comprehensive quotation for any corrective works to be undertaken and a suitable maintenance contract.
Other sites may not have as detailed logs and records. In this instance we assign a project manager to carry out a survey and audit. The data collected allows us to create a comprehensive overview with each asset listed and their last preventative maintenance visit and history logged. Information to include within an asset list includes:
- Original equipment manufacturer and/or supplier
- Equipment part, SKU and serial number(s) and firmware versions
- Outstanding warranty and terms & conditions
- Installation date and estimated working life
- Service records including last PMV date and the next due date
- Consumable replacement requirements including fans, filters and batteries
- Annual service requirements
- General and specialist testing regimes and certifications required
- Specialist service software and dongle requirements
- Maintenance engineer qualifications required
Once the asset list and maintenance logs have been created, a copy is shared with the client for comment or amendment and is then regularly updated during the life of the maintenance contract. It is important to ensure that asset and consumable disposals are recorded, as are new asset additions.
Preventing Data Centre Downtime
Critical infrastructure systems require regular maintenance, inspection, and monitoring. Some of this can be achieved using environmental monitoring and data centre infrastructure management (DCIM) packages.
Environmental monitoring systems are installed to monitor temperature & humidity levels and detect water leakage and smoke. Sudden changes to environmental factors can indicate a loss of cooling for example. With the appropriate sensors in place, and environmental monitoring system can monitor multiple points within a server room or data centre, with alarm conditions triggering alerts to key personnel via email, SMS text messages and phone calls.
DCIM packages can take environmental monitoring to the next level, providing more comprehensive monitoring of the overall data centre facility and its assets, health polling and modelling. In some instances, DCIMs can use #AI algorithms to analyse data collected and guide corrective actions in terms of managing air inlet temperatures, cooing system settings and arrangements.
For more information on Data Centre downtime:
https://www.sunbirddcim.com/blog/understanding-cost-data-center-downtime
In addition to these types of server room or data centre monitoring, pre-planned physical maintenance services must take place to prevent downtime. Visual inspections can identify issues that cannot be easily uncovered through data analysis and consumable items require replacement in line with system manufacturer’s recommendations.
The Uptime Institute regularly surveys the data centre market and one of their reports covers data centre outages. The latest report states that issues with UPS systems were the most common cause of power related outages (53%), citing fan failures, capacitor wear & tear, battery ageing and invert stack failures being the primary causes. All issues that can be prevented through regular and planned maintenance.
More information:
https://uptimeinstitute.com/annual-outage-analysis-2021
The Benefits of Planned Preventative Maintenance
Preventative maintenance visits (PMV) or Planned Preventative Maintenance (PPM) visits, as they are referred to, help to ensure system availability, and can improve overall system design and resilience.
During a maintenance visit, small issues can be observed, and corrective actions taken to avoid them becoming critical.
A classic example can be UPS batteries. Lead acid battery performance deteriorates with age, and an observable sign if a swelling of the battery case. In this instance, a battery set can be replaced before it becomes unable to provide sufficient power to the connected UPS system during a power outage. Generators require regular testing and inspection, with consumable items (fuel filters) that require regular replacement, as do air conditioning systems. LV switchboards may have built-in power factor correction, which will require capacitor inspection and thermal imaging to identify ‘hot-spots’ and potential failures.
The server racks themselves may require some adjustment to improve airflow. Blanking panels may be required or side panels refitted. It is also important to ensure environmental sensors are not only placed securely but regularly checked for calibration.
Security risks can also be identified during a maintenance visit. These can include checking that only approved personnel have access to the server room and/or specific server racks, and that CCTV cameras are operating as intended and their footage being recorded for incident analysis.
For some systems including fire suppression, regular maintenance and inspection certificates may be require for insurance purposes and/or to meet tenancy arrangements.
No matter whether the critical system is an uninterruptible power supply or air conditioner, technology and energy efficiency improvements occur with each new generation. A maintenance visit can help to identify systems due to replacement and how best to plan this.
Preventative maintenance checks also ensure equipment is optimal. This allows organisations to get the most out of a data centre. The latest technology allows a data centre to operate fast, smoothly, and effectively.
An important part of a maintenance inspection is to review previous issues and ensure that corrective actions taken have been effective and as planned. Where remedial actions are not effective, operational costs can rise and system reliability affected.
Summary
Serve Room Environments has a nationwide team of date centre maintenance and service engineers, providing 7-day a week, 24hours support, on-site services, and rapid callout facilities. We provide bespoke maintenance plans to suit the service levels required and budgets available, working with both existing clients and responding as fast as we can to new enquiries. Our extensive supply chain also means that we can also draft in additional services as ‘bolt-ons’ to ensure we provide one of the most comprehensive data centre maintenance service packages available in the UK.