7 July 2009—Last week, researchers from the University of Maryland and the University of Pennsylvania reported the workings of a computer program capable of learning to understand video footage and describe it in words.
A sort of video-to-text system, the software reports the action rather than the dialogue. So an analysis of a video showing Hank Aaron hitting his 715th career home run to surpass the mark set by Babe Ruth wouldn’t include the play-by-play announcer saying, ”It’s 715! There’s a new home run champion of all time!” What the software offers to differentiate the moment from Aaron’s previous 714 homers is the description of two fans running alongside the new home run king as he trots around the bases.
What makes the system, which the researchers described at the IEEE Computer Society’s Conference on Computer Vision and Pattern Recognition, in Miami Beach, unique is its ability to draw links among human actions and to understand causal relationships. To illustrate how the system works, the researchers showed how it analyzed footage of Major League Baseball games. During a learning period, the system watched games that had already been tagged with human-generated captions describing who the players were and what they were doing (pitcher: pitch; batter: no swing; batter: swing-miss; batter: swing-hit-run; fielder: run-catch-throw).
Like the average human fan, the system learned the mechanics of the game by watching. It created a set of hypotheses about the relationships among the actions using the narratives and measured the correctness of each using the training videos. For example, it learned that hitting the ball is causally dependent on pitching the ball since it isn’t possible to hit a ball unless it has been pitched. The system stores the actions and relationships from numerous plays, across several videos, onto a single database.
Having learned from someone else’s play-by-play descriptions of the actions on the field, the system was able to recognize the elements of baseball plays on unlabeled videos and to improve its own accuracy by comparing what players did in new videos to similar sequences of events it had seen in the training videos.
The software essentially produces a flowchart full of AND and OR junctions that account for all the possibilities, or story lines, that it has experienced. Abhinav Gupta, a doctoral candidate at the University of Maryland who was a member of the research team, admits that it is quite likely the system would foul up a description of a play that had not previously appeared on videos it had analyzed. But he notes that such an instance would immediately push the system back into learning mode. It would add that new wrinkle—say, a wild pitch or a balk—to its flowchart and instantly recognize it the next time it happened.
To account for the obvious fact that at certain junctures several things are happening at once (for example, the batter running to first base, the first baseman running to get in the path of the batted ball so he can catch and then throw it, and the pitcher running to first base to await the throw from the first baseman), the software creates records of the players’ movements on separate tracks. A good analogy is the way music engineers record lead vocals on one track, background vocals on another, drums on a third, and so on.
Once the system settles on the most likely story line for a play, it uses that as the basis for a linguistic description of the actions in the video. Gupta notes that the system’s phrasings sound like those of Frankenstein’s monster. It uses only nouns (batter), verbs (run), and prepositions (toward first base).
There are several immediately evident applications for generating descriptions of videos automatically. The operators of YouTube, who currently rely on producers to describe the content of the tens of thousands of videos uploaded onto the site each day, would be better able to categorize and call up videos based on search entries.
Gupta says the system would also be useful for video surveillance of places like airports and traffic intersections. Incidents can be tagged for later review by law enforcement as ”unusual/suspicious” and brought to the attention of security personnel—all without relying on humans to keep their eyes on video screens and remain attentive.