Digital Marketing

My Initial Impressions of Google’s Cloud Vision API

When Google released the Cloud Vision API last month I started to think of the interesting ways that artificially intelligent image recognition could be used.  One idea that jumped to mind was enhancing the ability to search for marketing images and stock photography in a digital asset management system.
As much as companies encourage individuals to tag and annotate images accurately, invariably the metadata is poor or missing altogether.  What if we could use artificial intelligence to add metadata to images automatically at the time they are uploaded to the system?  Could this new metadata make it easier or faster to find images?
Google’s Cloud Vision API and IBM Watson’s AlchemyVision API both provide this service.  And Adobe is incorporating this features natively into their AEM DAM, calling it Smart Tags.  I have tested the Google and IBM services and initially found that the Google Cloud Vision API returned more metadata for each image, and was slightly more precise with the labels it suggested.  I will continue to test these services as they advance, but for this post, I am going to focus on giving my impressions of the Google Cloud Vision API.

Labels are Good

I was quite impressed with the labels that Google’s Cloud Vision API returned.  It frequently found informative labels that matched something important in the image.  For the purpose of finding images, I would say the labels were definitely “better than nothing.”  Not perfect, and not necessarily exhaustive (more on this later), but if I had to choose these labels vs. no metadata at all, they would be helpful.
Here are a few examples of the useful labels returned by the API.  The numbers next to each label are a confidence score – the closer to 1.0, the higher Google’s confidence that the label is correct.  Click on the images to see full-sized versions:

Expenditure TypeUnit NameUnit Description (a) Expenditure Category(b) Revenue Category(c) Expenditure Type Class
AdministrativeHoursAdministrativelaborNon-Bill laborStraight Time
ClericalHoursClericallaborNon-Bill laborStraight Time
Customer LaborHoursBillable Labor HourslaborBillable laborStraight Time
Customer HotelDollarsBillable HotelTravelBillable TravelExpense Reports
Customer SuppliesDollarsBillable SuppliesSuppliesBillable SuppliesSupplier Invoice

The labels appear to be accurate and descriptive.  Nothing jumps out as incorrect.  In the second image, I was particularly impressed that it correctly identified the game-birds as ‘hunting decoys’.

In some images I tested the labels were not as great.  If the image was dark, dim, or blurred, it sometimes returned nonsensical labels.  An image of a silhouetted hunter at dusk carrying a deceased turkey on his back return “tyrannosaurus”.  It is a creative guess, but not very helpful.  It did have a low confidence score (0.58), so I might recommend avoiding labels below a certain threshold.  I cannot say for sure how low is too low, but around 0.55 the labels seem to be frequently wrong or less useful.  Again, for the purpose of a digital asset search, the benefit of the good labels possibly outweighs the bad labels.

OCR is Amazing

Digital Marketing - The Digital Essentials, Part 1
The Digital Essentials, Part 1

A compelling digital strategy aligns customer experiences, business execution, and the right technology as the market and your competition constantly evolve. Our Digital Essentials highlight the most compelling aspects of all three to help you react and respond to this ongoing evolution.

Get the Guide

If the labels were good, the optical character recognition (OCR) is A-M-A-Z-I-N-G.  I have been blown away by the text extraction, even at odd angles, in unusual fonts, and against complex backgrounds.  Here are some examples:

Go Forward Recommendations 
IF your pilot results indicateTHEN you might consider
80% or higher agree that Teams can be easily utilized alongside Skype for Business
-and-
Less than 80% user agreement that Teams can replace Skype for Business based on current use cases and scenarios
-and-
Satisfactory network health
Deploying Teams and Skype for Business side-by-side for some/all available scenarios. To facilitate the learning curve, we strongly encourage rolling out features over time, in lieu of an all-at-once approach.

Learn more about the Upgrade journey and coexistence of Skype for Business and Teams.

Download user readiness templates to facilitate communication with your end-users about their new side-by-side experience.
80% or higher user agreement that Teams can replace Skype for Business based on current use cases and scenarios
-and-
Satisfactory network health
Deploying Teams and Skype for Business side-by-side for all scenarios, encouraging users to lead with Teams where feasible. In addition, reach out to your account team or Microsoft Support to let them know your organization may be ready to go to Teams.

Learn more about the Upgrade journey and coexistence of Skype for Business and Teams.

Download user readiness templates to facilitate communication with your users about their new side-by-side experience.
Less than 80% agree that Teams can be easily utilized alongside Skype for Business
-and-
Less than 80% user agreement that Teams can replace Skype for Business based on current use cases and scenarios
Continuing with Skype for Business for communication (e.g. IM, Meetings, Calling) while utilizing the modern collaboration functionality (e.g., Teams/Channels) of Teams. Revisit a Teams pilot to verify communications functionality as new features are released per roadmap.

The text extraction in the Times Square image is astounding.  From the tiny Prudential billboard at the top, to the RENT billboard and the HSBC slogan “The world’s local bank”.  In the picture of the keyboard, it found the word ‘Backspace’ underneath and partially obscured by another object.  Yes, there are a few errors, but 9 times out of 10, Google nails the text extraction.  I often had to zoom in and hunt for the text that Google found because it was too small for me to see at first – à la Where’s Waldo.
One shortcoming I found in text detection was with foreign languages.  A McDonald’s sign in Russian produced: MaKAOHanAC.  It was close, but it did not pick up the Cyrillic characters.  Another example produced text like: B bl CT POT O.  My guess is that the API passes text through a dictionary that helps it tokenize words more accurately.  That does not appear to be enabled for non-English languages.

The Center of Attention

One interesting downside I have noticed is that the Cloud Vision API tends to label the dominant subject of an image, but does not necessarily label everything it finds in the image.  Once it finds something it recognizes, it seems to fixate on that, even though it could probably recognize other things in the image if they were presented in isolation.  Details that a marketer or designer might think are important in an image, like the presence of a person, are apparently deemed inconsequential.  Here are a few examples of this problem:

What are you most interested in learning today?

AnswerTotal NumberTotal %
What is blockchain380.21
Why blockchain180.1
Industry use cases1060.57
How to get started110.06
Best practices120.06
Total Responses: 185 of 428 (43%)

Notice in the first image, it found and described the apple pie, while failing to mention the non-pied apples, the baseball, or the bat.  In the second image, it does not mention the American flags along the bottom of the building, which might have been important.  The third image with a dog and a cat demonstrates the problem as well – it found the dog but ignored the cat.  This problem occurred in most images, and is probably difficult to solve.  Where do you draw the line?  Do you mention the grass underneath the game-birds?  Or the hat on a man in the background?
One workaround I have come up with is to divide the images into quadrants, for example, and submit each piece for separate analysis, hoping it might yield more labels than just the image as a whole.  I tried this with the apple pie image above – breaking it up into four pieces – and it did yield some additional labels, including ‘baseball equipment’, ‘apple’, and ‘fruit’.  It is more expensive to do this, but it would probably yield better results.  I guess it depends on your needs.  Is it more important to identify the primary subject of the image, or to find as many different things as possible?  Maybe Google will offer this as a configurable setting in the future.

What Logo?

Logo detection was tough.  It definitely worked better on clip-art-style drawings of logos than on photographs.  It almost never found logos buried inside complex images, like on a building or sign.  In the previous images, the Times Square image produced no logos at all, but the Coca-Cola painting on the brick wall did detect the ‘Coca Cola’ logo.
It also had trouble with logo variations – sometimes matching one form of a logo but not another.  Here are a couple of examples of the McDonald’s logo in various drawings and images.  It matched some, and missed others:

Where is your organization on its blockchain journey?

AnswerTotal NumberTotal %
Not yet started1300.62
Identifying use cases550.26
Competing a POC90.04
Building a prototype70.03
Doing a pilot50.02
Planning to scale30.01
Total Responses: 209 of 428 (49%)

Zeroing in with Landmarks

Landmark detection provided an interesting addition to label and text detection.  It was often markedly more precise than label detection alone, which could be the difference between finding an image or not.  Here are a few examples (showing the detected labels and landmarks, for comparison):

Poll Results for What industry are you in?   
Start Time: April 3, 2018 12:13:22 PM MDT

Total Responses: 222 of 428 (52%)

Results Summary
AnswerTotal NumberTotal %
Automotive70.03
Energy and Utilities140.06
Financial Services and Insurance470.21
Healthcare and Life Sciences600.27
Manufacturing140.06
Retail220.1
Other580.26



Union Station in Denver and St. Thomas Church in Leipzig were accurately detected and provide a more accurate description than the labels.  It was interesting, and maybe lucky, that the streetcar picture returned ‘French Quarter Gifts’.  French Quarter (as in New Orleans) would be a good addition to the picture, but I suspect it actually tried to use the name of a business near where the picture was taken (there are a lot of gifts shops along Canal St.).  Had it picked another business, the keywords might not have been as helpful. I’m not sure if lesser businesses would appear, though, as I have not seen any others so far.

Closing Remarks

In general, I remain very pleased with the power and flexibility of the Cloud Vision API.  At this point, I think it is approaching “useful” with respect to finding untagged images in a digital asset repository.  It’s not perfect, but I suspect it will get better over time.
I also saw today that Facebook is incorporating similar technology to automatically describe photographs to people who are visually impaired.  The new feature, called Automatic Alternative (Alt) Text, can provide a descriptive sentence about a photograph, such as “Image may contain: two people, smiling, sunglasses, sky, tree, outdoor.”  It is another example of this useful technology in action.

About the Author

Chad is a Principal of Search and Knowledge Discovery at Perficient. He was previously the Director of Perficient's national Google for Work practice.

More from this Author

Thoughts on “My Initial Impressions of Google’s Cloud Vision API”

  1. Wold be interesting if you could give the API a hint on where to focus during the search, for example CENTER-TO-BORDER
    TOP-TO-BOTTOM
    LEFT-TO-RIGHT
    Very nice article!

  2. How reliable it can correctly detect no. of faces from
    images/video?
    Thanks, very interesting review.
    David

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Subscribe to the Weekly Blog Digest:

Sign Up
Categories