My Initial Impressions of Google's Cloud Vision API

When Google released the Cloud Vision API last month I started to think of the interesting ways that artificially intelligent image recognition could be used. One idea that jumped to mind was enhancing the ability to search for marketing images and stock photography in a digital asset management system.
As much as companies encourage individuals to tag and annotate images accurately, invariably the metadata is poor or missing altogether. What if we could use artificial intelligence to add metadata to images automatically at the time they are uploaded to the system? Could this new metadata make it easier or faster to find images?
Google’s Cloud Vision API and IBM Watson’s AlchemyVision API both provide this service. And Adobe is incorporating this features natively into their AEM DAM, calling it Smart Tags. I have tested the Google and IBM services and initially found that the Google Cloud Vision API returned more metadata for each image, and was slightly more precise with the labels it suggested. I will continue to test these services as they advance, but for this post, I am going to focus on giving my impressions of the Google Cloud Vision API.

Labels are Good

I was quite impressed with the labels that Google’s Cloud Vision API returned. It frequently found informative labels that matched something important in the image. For the purpose of finding images, I would say the labels were definitely “better than nothing.” Not perfect, and not necessarily exhaustive (more on this later), but if I had to choose these labels vs. no metadata at all, they would be helpful.
Here are a few examples of the useful labels returned by the API. The numbers next to each label are a confidence score – the closer to 1.0, the higher Google’s confidence that the label is correct. Click on the images to see full-sized versions:

Expenditure Type	Unit Name	Unit Description	(a) Expenditure Category	(b) Revenue Category	(c) Expenditure Type Class
Administrative	Hours	Administrative	labor	Non-Bill labor	Straight Time
Clerical	Hours	Clerical	labor	Non-Bill labor	Straight Time
Customer Labor	Hours	Billable Labor Hours	labor	Billable labor	Straight Time
Customer Hotel	Dollars	Billable Hotel	Travel	Billable Travel	Expense Reports
Customer Supplies	Dollars	Billable Supplies	Supplies	Billable Supplies	Supplier Invoice

The labels appear to be accurate and descriptive. Nothing jumps out as incorrect. In the second image, I was particularly impressed that it correctly identified the game-birds as ‘hunting decoys’.

In some images I tested the labels were not as great. If the image was dark, dim, or blurred, it sometimes returned nonsensical labels. An image of a silhouetted hunter at dusk carrying a deceased turkey on his back return “tyrannosaurus”. It is a creative guess, but not very helpful. It did have a low confidence score (0.58), so I might recommend avoiding labels below a certain threshold. I cannot say for sure how low is too low, but around 0.55 the labels seem to be frequently wrong or less useful. Again, for the purpose of a digital asset search, the benefit of the good labels possibly outweighs the bad labels.

OCR is Amazing

The Digital Essentials, Part 3

Developing a robust digital strategy is both a challenge and an opportunity. Part 3 of the Digital Essentials guide series explores five of the essential technology-driven experiences customers expect, which you may be missing or not fully utilizing.

Get the Guide

If the labels were good, the optical character recognition (OCR) is A-M-A-Z-I-N-G. I have been blown away by the text extraction, even at odd angles, in unusual fonts, and against complex backgrounds. Here are some examples:

Go Forward Recommendations
IF your pilot results indicate	THEN you might consider
80% or higher agree that Teams can be easily utilized alongside Skype for Business -and- Less than 80% user agreement that Teams can replace Skype for Business based on current use cases and scenarios -and- Satisfactory network health	Deploying Teams and Skype for Business side-by-side for some/all available scenarios. To facilitate the learning curve, we strongly encourage rolling out features over time, in lieu of an all-at-once approach. Learn more about the Upgrade journey and coexistence of Skype for Business and Teams. Download user readiness templates to facilitate communication with your end-users about their new side-by-side experience.
80% or higher user agreement that Teams can replace Skype for Business based on current use cases and scenarios -and- Satisfactory network health	Deploying Teams and Skype for Business side-by-side for all scenarios, encouraging users to lead with Teams where feasible. In addition, reach out to your account team or Microsoft Support to let them know your organization may be ready to go to Teams. Learn more about the Upgrade journey and coexistence of Skype for Business and Teams. Download user readiness templates to facilitate communication with your users about their new side-by-side experience.
Less than 80% agree that Teams can be easily utilized alongside Skype for Business -and- Less than 80% user agreement that Teams can replace Skype for Business based on current use cases and scenarios	Continuing with Skype for Business for communication (e.g. IM, Meetings, Calling) while utilizing the modern collaboration functionality (e.g., Teams/Channels) of Teams. Revisit a Teams pilot to verify communications functionality as new features are released per roadmap.

The text extraction in the Times Square image is astounding. From the tiny Prudential billboard at the top, to the RENT billboard and the HSBC slogan “The world’s local bank”. In the picture of the keyboard, it found the word ‘Backspace’ underneath and partially obscured by another object. Yes, there are a few errors, but 9 times out of 10, Google nails the text extraction. I often had to zoom in and hunt for the text that Google found because it was too small for me to see at first – à la Where’s Waldo.
One shortcoming I found in text detection was with foreign languages. A McDonald’s sign in Russian produced: MaKAOHanAC. It was close, but it did not pick up the Cyrillic characters. Another example produced text like: B bl CT POT O. My guess is that the API passes text through a dictionary that helps it tokenize words more accurately. That does not appear to be enabled for non-English languages.

The Center of Attention

One interesting downside I have noticed is that the Cloud Vision API tends to label the dominant subject of an image, but does not necessarily label everything it finds in the image. Once it finds something it recognizes, it seems to fixate on that, even though it could probably recognize other things in the image if they were presented in isolation. Details that a marketer or designer might think are important in an image, like the presence of a person, are apparently deemed inconsequential. Here are a few examples of this problem:

What are you most interested in learning today?

Answer	Total Number	Total %
What is blockchain	38	0.21
Why blockchain	18	0.1
Industry use cases	106	0.57
How to get started	11	0.06
Best practices	12	0.06
Total Responses: 185 of 428 (43%)

Notice in the first image, it found and described the apple pie, while failing to mention the non-pied apples, the baseball, or the bat. In the second image, it does not mention the American flags along the bottom of the building, which might have been important. The third image with a dog and a cat demonstrates the problem as well – it found the dog but ignored the cat. This problem occurred in most images, and is probably difficult to solve. Where do you draw the line? Do you mention the grass underneath the game-birds? Or the hat on a man in the background?
One workaround I have come up with is to divide the images into quadrants, for example, and submit each piece for separate analysis, hoping it might yield more labels than just the image as a whole. I tried this with the apple pie image above – breaking it up into four pieces – and it did yield some additional labels, including ‘baseball equipment’, ‘apple’, and ‘fruit’. It is more expensive to do this, but it would probably yield better results. I guess it depends on your needs. Is it more important to identify the primary subject of the image, or to find as many different things as possible? Maybe Google will offer this as a configurable setting in the future.

What Logo?

Logo detection was tough. It definitely worked better on clip-art-style drawings of logos than on photographs. It almost never found logos buried inside complex images, like on a building or sign. In the previous images, the Times Square image produced no logos at all, but the Coca-Cola painting on the brick wall did detect the ‘Coca Cola’ logo.
It also had trouble with logo variations – sometimes matching one form of a logo but not another. Here are a couple of examples of the McDonald’s logo in various drawings and images. It matched some, and missed others:

Where is your organization on its blockchain journey?

Answer	Total Number	Total %
Not yet started	130	0.62
Identifying use cases	55	0.26
Competing a POC	9	0.04
Building a prototype	7	0.03
Doing a pilot	5	0.02
Planning to scale	3	0.01
Total Responses: 209 of 428 (49%)

Zeroing in with Landmarks

Landmark detection provided an interesting addition to label and text detection. It was often markedly more precise than label detection alone, which could be the difference between finding an image or not. Here are a few examples (showing the detected labels and landmarks, for comparison):

Poll Results for What industry are you in?
Start Time: April 3, 2018 12:13:22 PM MDT

Total Responses: 222 of 428 (52%)

Results Summary
Answer	Total Number	Total %
Automotive	7	0.03
Energy and Utilities	14	0.06
Financial Services and Insurance	47	0.21
Healthcare and Life Sciences	60	0.27
Manufacturing	14	0.06
Retail	22	0.1
Other	58	0.26

Union Station in Denver and St. Thomas Church in Leipzig were accurately detected and provide a more accurate description than the labels. It was interesting, and maybe lucky, that the streetcar picture returned ‘French Quarter Gifts’. French Quarter (as in New Orleans) would be a good addition to the picture, but I suspect it actually tried to use the name of a business near where the picture was taken (there are a lot of gifts shops along Canal St.). Had it picked another business, the keywords might not have been as helpful. I’m not sure if lesser businesses would appear, though, as I have not seen any others so far.

Closing Remarks

In general, I remain very pleased with the power and flexibility of the Cloud Vision API. At this point, I think it is approaching “useful” with respect to finding untagged images in a digital asset repository. It’s not perfect, but I suspect it will get better over time.
I also saw today that Facebook is incorporating similar technology to automatically describe photographs to people who are visually impaired. The new feature, called Automatic Alternative (Alt) Text, can provide a descriptive sentence about a photograph, such as “Image may contain: two people, smiling, sunglasses, sky, tree, outdoor.” It is another example of this useful technology in action.

Thoughts on “My Initial Impressions of Google’s Cloud Vision API”

Nono Junang May 4, 2016 at 5:14 pm

Wold be interesting if you could give the API a hint on where to focus during the search, for example CENTER-TO-BORDER
TOP-TO-BOTTOM
LEFT-TO-RIGHT
Very nice article!
David August 8, 2016 at 1:56 am

How reliable it can correctly detect no. of faces from
images/video?
Thanks, very interesting review.
David

This site uses Akismet to reduce spam. Learn how your comment data is processed.

My Initial Impressions of Google’s Cloud Vision API

by Chad Johnson on April 6th, 2016 | ~ minute read

Chad Johnson

Categories

Follow Us