Tricking Google’s computer vision AI into seeing a pair of human skiers as a dog may seem mostly harmless. But the possibilities become more unnerving when considering how hackers could trick a self-driving car’s AI into seeing a plastic bag instead of a child up ahead. Or making future surveillance systems overlook a gun because they see it as a toy doll.
An independent AI research group run by MIT students has demonstrated a new way to fool the computer vision algorithms that enable AI systems to see the world—an approach that could prove up to 1000 times as fast as other existing ways of hacking “black box” systems whose inner workings remain hidden to outsiders. That idea of a black box perfectly describes the neural networks behind the deep learning algorithms enabling computer vision services for Google, Facebook, and other companies. In fact, the MIT team showed how its attack algorithm could readily trick Google’s service to misclassify dogs and all sorts of objects.
In a new paper, Athalye and his colleagues at the LabSix research group group describe how they exploited the Google Cloud Vision API that has been made publicly available to developers who want their programs to have the capability to perform “image labeling, face and landmark detection, optical character recognition (OCR), and tagging of explicit content.” But the LabSix group notes that any computer vision service that relies upon deep learning—such as Amazon Rekognition or Clarifai’s image classification—could be vulnerable to their approach.
“It’s one thing to show that an algorithm works in controlled scientific settings, and another thing to show that it works against real-world systems,” says Anish Athalye, a PhD candidate in computer science at MIT. “Based on our experiments, we were fairly confident that we should be able to break Google Cloud Vision, but it’s always cool to see it actually work in practice.”
The black box nature of deep learning algorithms—the most popular versions of machine learning AI systems—makes it especially tough for would-be hackers to figure out how to exploit them. Deep learning algorithms will typically analyze the different patterns of pixels found in digital images to classify the overall pictures under object categories such as “dog” or “cat.” Hackers who want to trick such computer vision services often try out test images and tweak the images slightly each time to figure out how to fool the deep learning algorithms.
The crudest hacking approach might involve changing individual pixels within images and seeing how that changes the way the algorithms classify those images. That’s obviously hugely “inefficient and impractical” when tackling large images with tens of thousands of pixels, says Andrew Ilyas, an undergraduate researcher in computer science at MIT. Instead, the LabSix group adapted a “natural evolution strategies” method that was first proposed by researchers in Switzerland and Germany.
“You can think of our evolutionary strategies approach as generating smaller populations of images around an image, where large random groups of pixels are perturbed instead of single pixels,” Ilyas says. “Then, some neat math will show that we can, given the classifier’s output on these randomly perturbed images, recover what the contribution of each individual pixel is to the classification output.”
Once they figured out what makes each deep learning algorithm tick, the researchers could create “adversarial images” that would trick a computer vision service into seeing a cat as an airplane or a deer as a truck. They produced such adversarial images by first manipulating individual pixels to skew the classification toward a given category such as “dog.” Then they adjusted the color values of the overall image to make it appear more like a completely separate image category such as “skier.”
The LabSix group’s method is the “first attack algorithm for the ‘partial-information’ setting,” says Jessy Lin, an undergraduate researcher in computer science at MIT. Google Cloud Vision is one example of a “partial-information setting” because each image query provides results including only 10 image classes with uninterpretable scores that are baffling to anyone but Google’s algorithm developers. (By comparison, more transparent algorithms might at least assign a percentage score that indicates the probability of a given image being a cat or a dog.)
“Our partial-information black-box attack can generate an adversarial image without knowing the probabilities for all the classes—in fact, without even knowing all the classes or having probabilities at all,” Lin says. “Our method makes real-world attacks possible where they would previously be intractable or very costly.”
The LabSix researchers submitted their paper for consideration at the upcoming Computer Vision and Pattern Recognition conference (CVPR) taking place in Utah from 18-22 June 2018. They don’t plan to commercialize their findings, but want to continue demonstrating possible exploits for deep learning systems so that security researchers can patch the holes.
“There haven’t been notable incidents yet, but our attack demonstrates that it’s very much a real concern,” says Logan Engstrom, an undergraduate researcher in computer science at MIT. “Large-scale commercial machine learning systems are being used in real systems, and this kind of attack can be directly applied to make them misbehave.”