IT Brief Australia - Technology news for CIOs & IT decision-makers
Story image
New MLCommons delivers competitive AI gains for Intel
Fri, 30th Jun 2023

MLCommons has published results of its industry AI performance benchmark, MLPerf Training 3.0, in which both the Habana Gaudi2 deep learning accelerator and the 4th Gen Intel Xeon Scalable processor delivered impressive training results.

“The latest MLPerf results published by MLCommons validates the TCO value Intel Xeon processors and Intel Gaudi deep learning accelerators provide to customers in the area of AI. Xeon’s built-in accelerators make it an ideal solution to run volume AI workloads on general-purpose processors, while Gaudi delivers competitive performance for large language models and generative AI,” says Sandra Rivera, Intel executive vice president and general manager of the data centre and AI group. 

“Intel’s scalable systems with optimised, easy-to-program open software lowers the barrier for customers and partners to deploy a broad array of AI-based solutions in the data centre from the cloud to the intelligent edge.”

The current industry narrative is that generative AI and large language models (LLMs) can run only on Nvidia GPUs. New data shows that Intel’s portfolio of AI solutions provides competitive and compelling options for customers looking to break free from closed ecosystems that limit efficiency and scale.

The latest MLPerf Training 3.0 results underscore the performance of Intel’s products on an array of deep learning models. The maturity of Gaudi2-based software and systems for training was demonstrated at scale on the large language model, GPT-3. Gaudi2 is one of only two semiconductor solutions to submit performance results to the benchmark for LLM training of GPT-3. 

Gaudi2 also provides substantially competitive cost advantages to customers, both in server and system costs. The accelerator's MLPerf-validated performance on GPT-3, computer vision, natural language models, and upcoming software advances make Gaudi2 an extremely compelling price/performance alternative to Nvidia's H100.

On the CPU front, the deep learning training performance of 4th Gen Xeon processors with Intel AI engines demonstrated that customers could build with Xeon-based servers a single universal AI system for data pre-processing, model training and deployment to deliver the right combination of AI performance, efficiency, accuracy and scalability.

“Training generative AI and large language models requires clusters of servers to meet massive compute requirements at scale. These MLPerf results provide tangible validation of Habana Gaudi2’s outstanding performance and efficient scalability on the most demanding model tested, the 175 billion parameter GPT-3,” adds Rivera.

Some of the highlights in the results achieved are as below.

Gaudi2 delivered impressive time-to-train on GPT-3*: 311 minutes on 384 accelerators.

Near-linear 95% scaling from 256 to 384 accelerators on GPT-3 model.

Excellent training results on computer vision - ResNet-50 8 accelerators and Unet3D 8 accelerators - and natural language processing models - BERT 8 and 64 accelerators.

Performance increases of 10% and 4% for BERT and ResNet models compared to the November submission, evidence of growing Gaudi2 software maturity.

Gaudi2 results were submitted "out of the box," meaning customers can achieve comparable performance when implementing Gaudi2 on-premise or in the cloud.

Software support for the Gaudi platform continues to mature and keep pace with the growing number of generative AI and LLMs in popular demand.

Gaudi2's GPT-3 submission was based on PyTorch and employed the popular DeepSpeed optimisation library (part of Microsoft AI at scale) rather than custom software. DeepSpeed enables support of 3D parallelism (Data, Tensor, Pipeline) concurrently, further optimising scaling performance efficiency on LLMs.

Gaudi2 results on the 3.0 benchmark were submitted in the BF16 data type. A significant leap in Gaudi2 performance is expected when software support for FP8 and new features are released in 2023’s third quarter.

“As the lone CPU submission among numerous alternative solutions, MLPerf results prove that Intel Xeon processors provide enterprises with out-of-the-box capabilities to deploy AI on general-purpose systems and avoid the cost and complexity of introducing dedicated AI systems,” notes Rivera.

“For a few customers who intermittently train large models from scratch, they can use general-purpose CPUs, and often on the Intel-based servers they are already deploying to run their businesses. However, most will use pre-trained models and fine-tune them with their own smaller curated data sets. Intel previously released results demonstrating that this fine-tuning can be accomplished in only minutes using Intel AI software and standard industry open source software.”

Rivera also shared the MLPerf results highlights.

In the closed division, 4th Gen Xeons could train BERT and ResNet-50 models in less than 50 mins. (47.93 mins.) and less than 90 mins. (88.17 mins.), respectively.

With BERT in the open division, the results show that Xeon could train the model in about 30 minutes (31.06 mins.) when scaling to 16 nodes.

For the larger RetinaNet model, Xeon achieved a time of 232 mins. On 16 nodes, customers can use off-peak Xeon cycles to train their models throughout the morning, over lunch or overnight.

4th Gen Xeon with Intel Advanced Matrix Extensions (Intel® AMX) delivers significant out-of-box performance improvements that span multiple frameworks, end-to-end data science tools and a broad ecosystem of smart solutions.

“MLPerf, generally regarded as the most reputable benchmark for AI performance, enables fair and repeatable performance comparison across solutions. Additionally, Intel has surpassed the 100-submission milestone and remains the only vendor to submit public CPU results with industry-standard deep-learning ecosystem software,” says Rivera.

“These results also highlight the excellent scaling efficiency possible using cost-effective and readily available Intel Ethernet 800 Series network adapters that utilise the open source Intel Ethernet Fabric Suite Software that’s based on Intel oneAPI.”