“I had to double check I wasn’t playing the wrong audio file.”
The first time Abe Davis coaxed intelligible speech from a silent video of a bag of crab chips (an impassioned recitation of “Mary Had a Little Lamb”) he could hardly believe it was possible. Davis is a Ph.D. candidate at MIT, and his group’s image processing algorithm can turn everyday objects into visual microphones—deciphering the tiny vibrations they undergo as captured on video.
The research, which will be presented at the computer graphics conference SIGGRAPH 2014 next week, builds on work from MIT’s Computer Science and Artificial Intelligence Laboratory to capture movement on video much smaller than a single pixel. By seeing how border pixels on an object fluctuated in color, the group’s algorithm can measure and calculate the object's minuscule movements (and even magnify a wine glass’s oscillations when a tone is played or visually reveal a heartbeat under the skin).
“It was clear for us quickly that there’s a strong relation between sound and visual motion,” says Michael Rubinstein, a postdoc at Microsoft Research who worked on this and the earlier CSAIL research. “We had this crazy idea: can we actually use videos to recover sound?”
The first speech recovered from the chip bag can be played below. (Future recordings were much clearer, but probably less funny.)