Last March, Google’s computers roundly beat the world-class Go champion Lee Sedol, marking a milestone in artificial intelligence. The winning computer program, created by researchers at Google DeepMind in London, used an artificial neural network that took advantage of what’s known as deep learning, a strategy by which neural networks involving many layers of processing are configured in an automated fashion to solve the problem at hand.
Unknown to the public at the time was that Google had an ace up its sleeve. You see, the computers Google used to defeat Sedol contained special-purpose hardware—a computer card Google calls its Tensor Processing Unit.
Norm Jouppi, a hardware engineer at Google, announced the existence of the Tensor Processing Unit two months after the Go match, explaining in a blog post that Google had been outfitting its data centers with these new accelerator cards for more than a year. Google has not shared exactly what is on these boards, but it’s clear that it represents an increasingly popular strategy to speed up deep-learning calculations: using an application-specific integrated circuit, or ASIC.
Another tactic being pursued (primarily by Microsoft) is to use field-programmable gate arrays (FPGAs), which provide the benefit of being reconfigurable if the computing requirements change. The more common approach, though, has been to use graphics processing units, or GPUs, which can perform many mathematical operations in parallel. The foremost proponent of this approach is GPU maker Nvidia.
Indeed, advances in GPUs kick-started artificial neural networks back in 2009, when researchers at Stanford showed that such hardware made it possible to train deep neural networks in reasonable amounts of time [PDF].
“Everybody is doing deep learning today,” says William Dally, who leads the Concurrent VLSI Architecture group at Stanford and is also chief scientist for Nvidia. And for that, he says, perhaps not surprisingly given his position, “GPUs are close to being as good as you can get.”
Dally explains that there are three separate realms to consider. The first is what he calls “training in the data center.” He’s referring to the first step for any deep-learning system: adjusting perhaps many millions of connections between neurons so that the network can carry out its assigned task.
In building hardware for that, a company called Nervana Systems, which was recently acquired by Intel, has been leading the charge. According to Scott Leishman, a computer scientist at Nervana, the Nervana Engine, an ASIC deep-learning accelerator, will go into production in early to mid-2017. Leishman notes that another computationally intensive task—bitcoin mining—went from being run on CPUs to GPUs to FPGAs and, finally, on ASICs because of the gains in power efficiency from such customization. “I see the same thing happening for deep learning,” he says.
A second and quite distinct job for deep-learning hardware, explains Dally, is “inference at the data center.” The word inference here refers to the ongoing operation of cloud-based artificial neural networks that have previously been trained to carry out some job. Every day, Google’s neural networks are making an astronomical number of such inference calculations to categorize images, translate between languages, and recognize spoken words, for example. Although it’s hard to say for sure, Google’s Tensor Processing Unit is presumably tailored for performing such computations.
Training and inference often take very different skill sets. Typically for training, the computer must be able to calculate with relatively high precision, often using 32-bit floating-point operations. For inference, precision can be sacrificed in favor of greater speed or less power consumption. “This is an active area of research,” says Leishman. “How low can you go?”
Although Dally declines to divulge Nvidia’s specific plans, he points out that the company’s GPUs have been evolving. Nvidia’s earlier Maxwell architecture could perform double- (64-bit) and single- (32-bit) precision operations, whereas its current Pascal architecture adds the capability to do 16-bit operations at twice the throughput and efficiency of its single-precision calculations. So it’s easy to imagine that Nvidia will eventually be releasing GPUs able to perform 8-bit operations, which could be ideal for inference calculations done in the cloud, where power efficiency is critical to keeping costs down.
Dally adds that “the final leg of the tripod for deep learning is inference in embedded devices,” such as smartphones, cameras, and tablets. For those applications, the key will be low-power ASICs. Over the coming year, deep-learning software will increasingly find its way into applications for smartphones, where it is already used, for example, to detect malware or translate text in images.
And the drone manufacturer DJI is already using something akin to a deep-learning ASIC in its Phantom 4 drone, which uses a special visual-processing chip made by California-based Movidius to recognize obstructions. (Movidius is yet another neural-network company recently acquired by Intel.) Qualcomm, meanwhile, built special circuitry into its Snapdragon 820 processors to help carry out deep-learning calculations.
Although there is plenty of incentive these days to design hardware to accelerate the operation of deep neural networks, there’s also a huge risk: If the state of the art shifts far enough, chips designed to run yesterday’s neural nets will be outdated by the time they are manufactured. “The algorithms are changing at an enormous rate,” says Dally. “Everybody who is building these things is trying to cover their bets.”
This article appears in the January 2017 print issue as “Deeper and Cheaper Machine Learning.”