Google Is Teaching Machines To 'See' Pictures And Describe Them With Startling Accuracy

JILLIAN D'ONFRO

Nov 19, 2014, 00:34 IST

Google wants to be able to create accurate, automatic captions for complex photos, and, according to a recent blog post, it's getting closer to achieving that goal.

Google's machine-learning system can "see" a photo and automatically produce descriptive and relevant captions. The system has to be able to get a deeper representation of what's going on in a picture by recognizing how different objects in the picture relate to one another, an then translate that into natural-sounding description.

For example, the system automatically captioned this photo, "Two pizzas sitting on top of a stove top oven":

Complimentary Tech Event

Transform talent with learning that works

Capability development is critical for businesses who want to push the envelope of innovation.Discover how business leaders are strategizing around building talent capabilities and empowering employee transformation.Know More

Google

"This kind of system could eventually help visually impaired people understand pictures, provide alternate text for images in parts of the world where mobile connections are slow, and make it easier for everyone to search on Google for images," Google research scientists Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan write.

The key innovation is that Google's team has merged a computer vision system, which classifies objects in images, and a natural language processing model into one system, where it can take an image and directly produce a sentence to describe it.

In Google's words, the model "combines a vision CNN with a language-generating RNN so it can take in an image and generate a fitting natural-language caption":