Maintaining uptime in the data center is no game of checkers
Article by Intel Data Center Management Solutions general manager Jeff Klaus.
While popular notions of artificial intelligence (AI) may have once conjured up images of automation such as 2001’s HAL 9000, The Terminator’s Skynet, and Ava of Ex Machina, in reality, AI and its subset, machine learning, had more benign origins.
Machine learning is a method of data analysis that automates analytical model building. Using algorithms that iteratively learn from data, machine learning allows computers to find hidden insights without being explicitly programmed on where to look for them.
Arthur Samuel, one of the pioneers of machine learning, taught a computer program to play checkers, an objective that is not something he could have programmed explicitly. In 1962, Samuel’s machine learning program not only bested him in the game of checkers, but ultimately was successful in overcoming the Connecticut state champion.
AI enters the data center
Today, Gartner estimates that 37% of enterprise organizations are already implementing AI in some form, and associated technologies such as machine learning and deep learning promise to save organizations billions of dollars over the next few decades as financial services, healthcare, oil and gas, and retail companies build data science applications, recommendation engines, large-scale analytics, and other new applications driven by high-performance computing (HPC) environments.
AI and machine learning technologies can also be leveraged to improve the efficiency of IT operations in the data center. In the operation of enterprise and cloud service provider (CSP) data centers, IT equipment is commonly managed and operated in a passive manner.
That is, IT operators can do little to nothing before servers, network and storage equipment failures happen, after which they invariably ask their equipment vendors to repair devices or take reactive measures, thus commencing a standby environment or deploying a business load.
This method can be adequate for managing small-scale server clusters. However, for current large enterprise or CSPs, which can have more than thousands of servers, IT teams would be under tremendous operational and maintenance pressures if they continued to manage their equipment this way.
For the enterprise and CSP, which require high reliability in the operation of the data center, and particularly given the high availability and stability demands placed on public cloud service providers, the inherent risks involved could prove damaging to both the balance sheet and organizations’ brand reputation.
A 2019 global survey of enterprise organizations by Statista found that for one out of four companies worldwide, the average cost of server downtime was between $301,000 and $400,000 per hour. With these stakes in play, maintaining uptime in the data center is no game of checkers.
Today, many types of IT equipment provide logs to help diagnose and analyze problems, collecting data through out-of-band or operating system agents, learning error patterns in the logs through machine learning algorithms, and establishing corresponding models for abnormal judgment and identification.
Furthermore, analyzing the equipment operating status helps to make fine-grained predictions of the equipment health status. The operator or the software system can take the next action by analyzing the results or trends before failure happens. For example, adjusting the load on the server and migrating the load.
Machine learning in service of uptime
Memory failures are one of the top three hardware failures that occur in data centers today. Using machine learning to analyze real-time memory health data would make it possible to predict such failures ahead of time, and this ultimately translates to a better experience for end users of the application.
The Intel Memory Failure Prediction (MFP) is an AI-based technology for improving memory reliability due to predictions based on the analysis of the micro-level memory failure logs. It’s an ideal solution for enterprise businesses and CSPs that rely heavily on server hardware reliability, availability and serviceability. Intel MFP helps to significantly reduce memory failure events by analyzing data and then predicting catastrophic events before they happen.
Intel MFP uses machine learning to analyze server memory errors down to the Dual Inline Memory Module (DIMM), bank, column, row, and cell levels to generate a memory health score, which can be used to predict potential failures. By analyzing memory errors and predicting potential memory failures before they happen, Intel MFP can help improve DIMM toss and purchase decisions.
Additionally, Intel MFP allows data center staff to migrate workloads before catastrophic memory failures could happen, use page offlining policies to isolate unreliable memory cells or pages, or replace failing DIMMs before they reach a terminal stage, thus reducing downtime by responding appropriately before server failure occurs.
Recently, a Beijing-based company whose online platform and applications connect consumers with local businesses for everything from food delivery and hotel bookings to health and fitness products and services monitored the health of the memory modules of its servers by integrating Intel MFP into their existing data center management solution.
The initial test deployment of Intel MFP indicated that if the company deployed the solution across its full server network, server crashes caused by hardware failures could be reduced by up to 40 percent, delivering a better experience for hundreds of millions of its customers and local vendors.