Google has made its Cloud TPUs available in beta on Google Cloud Platform (GCP) to help machine learning experts train and run their ML models faster.
Google defines its cloud TPUs (tensor processing unit) as hardware accelerators that are optimised to speed up and scale up specific ML workloads programmed with TensorFlow.
Each Cloud TPU is built with four custom ASICs, and provides up to 180 teraflops of floating-point performance and 64 GB of high-bandwidth memory onto a single board.
The boards can be used alone or connected via an ultra-fast, dedicated network to form multi-petaflop ML supercomputers called “TPU pods.”, Google explained in a blog post yesterday.
Google stated that it will offer these larger supercomputers on GCP later in the year.
“We designed Cloud TPUs to deliver differentiated performance per dollar for targeted TensorFlow workloads and to enable ML engineers and researchers to iterate more quickly,” Google said on its blog. The company elaborated on this with three examples:
- Instead of waiting for a job to schedule on a shared compute cluster, you can have interactive, exclusive access to a network-attached Cloud TPU via a Google Compute Engine VM that you control and can customise
- Rather than waiting days or weeks to train a business-critical ML model, you can train several variants of the same model overnight on a fleet of Cloud TPUs and deploy the most accurate trained model in production the next day
- Using a single Cloud TPU and following this tutorial, you can train ResNet-50 to the expected accuracy on the ImageNet benchmark challenge in less than a day, all for well under $200
ML model training
Google’s Cloud TPUs can be programmed with high-level TensorFlow APIs, and the company has open-sourced a set of reference high-performance Cloud TPU model implementations.
Google plans to open-source additional model implementations over time.
“Adventurous ML experts may be able to optimise other TensorFlow models for Cloud TPUs on their own using the documentation and tools we provide,” Google added.
Google will introduce TPU pods later this year which will improve the time-to-accuracy of Cloud TPUs.
“Both ResNet-50 and Transformer training times drop from the better part of a day to under 30 minutes on a full TPU pod, no code changes required,” the blog detailed.
Two Sigma chief technology officer and former senior Google engineer Alfred Spector comments, “We made a decision to focus our deep learning research on the cloud for many reasons, but mostly to gain access to the latest machine learning infrastructure.”
“Google Cloud TPUs are an example of innovative, rapidly evolving technology to support deep learning, and we found that moving TensorFlow workloads to TPUs has boosted our productivity by greatly reducing both the complexity of programming new models and the time required to train them.”
Spector concludes, “Using Cloud TPUs instead of clusters of other accelerators has allowed us to focus on building our models without being distracted by the need to manage the complexity of cluster communication patterns.”