Experience Management

Reverse engineering the GSA Query Suggestions feature


Google often quotes that the average query submitted to their website is 1.7 words long.  That means that most queries contain only two words, and a fair number contain only one word.  Google can be magical at times, but using one or two words to search through trillions of web pages seems at best hit or miss.  In practice, longer and more specific queries tend to produce better results.
Query Suggestions to the Rescue
How do we encourage users to use more than 1.7 words in their query?  The Query Suggestions feature is one popular and effective solution.  By presenting the user with a list of suggested queries below the search box, they only have to type a few letters before seeing a list of queries to pick from.
But the suggestions must be relevant and compelling.  The search engine must provide a helpful list of suggestions after only a few characters have been typed.  This might seem like more Google magic, but in reality, the process is quite simple.  By conducting a few small experiments, I have been able to reverse engineer how the process works on the Google Search Appliance.
The GSA’s Query Suggestions come from a single source – the query logs.  The query logs store the history of all searches run over the last 90 days.  Suggestions are not generated from content in the index, and they are not derived from any external source, like Google.com.  They only come from previous queries submitted against the GSA – or GSAs plural if you have mirroring enabled.  (Note: GSA 7.2 also includes User Added Results in the list of suggestions)
Once a night the GSA analyzes the search logs and creates an optimized database of all the historical queries in the logs (Note: Only queries that returned at least one result are considered, which eliminates noise in the suggestions).  The database is optimized for “begins with” lookups.  When the user starts typing a few letters, it will find previous queries that begin with what the user is starting to type.  The “begins with” matching starts only at the beginning of the queries, not anywhere in the middle.  If I start typing “The Wi” it will suggest “The Wizard of Oz”, but not “Gone With The Wind”.
Suggestions are scoped by three variables: Collection, Front End and Access.  You must pass these three values to the “/suggest” service on the GSA, and the GSA will only offer suggestions using historical queries submitted against the same Collection, Front End, and Access type (public vs. secure).  If you use a single GSA to power search on multiple sites, or you use different Front Ends to tailor the search experience for different audiences, scoping prevents queries from one site or audience from bleeding over as suggestions for a different site or audience.  However, no further security or scoping protects the suggestions.  All queries submitted to a certain Collection, Front End, and Access type are eligible to be returned to any other user in the same scope.
Because the suggestions come from user queries, and not from the content itself, content ACL’s have no effect on the suggestions.  If a user searches for a sensitive term, it is possible for that term to be presented as a suggestion to another user in the same scope.  Blacklists can be used to hide sensitive terms or patterns from the suggestion list, like Social Security numbers or credit card numbers.  The suggestion database can also be exported manually and scanned for sensitive information.
Google is a bit vague about how the suggestions are ranked, simply stating that the most popular queries are shown.  A quick test reveals that the suggestions are sorted by the number of occurrences in the logs – largest on top.  Suppose I submit the following queries against a new Front End:  “blue”, “black”, “brown”, “blue”, “brown”, “blue”.  If I start a new query by typing the letter “b”, I will get the following list of suggestions: “blue”, “brown”, “black”.  The order exactly follows the frequency of each query (3 vs. 2 vs. 1).  This approach is an obvious solution to the problem, and it does a good job of satisfying the relevancy requirement stated above.  Queries that are submitted very often will bubble to the top of the suggestion list, while random or infrequent queries will fall to the bottom or be truncated and not shown.  This is a wisdom of the crowd solution, and the results get better as the audience size and query volume increase.
Pre-populating Suggestions
Google does not provide a mechanism to pre-load suggestions, but a very simple shell script can be used to do so.  Create a script that runs one search for each desired term or phrase.  This will cause the suggestion database to be pre-populated with those entries.  The suggestion database will quickly build once the site is live, but this technique is useful to avoid a lack of suggestions on day one.

curl http://mygsa.company.com/search?q=One&site=my_collection&client=my_frontend
curl http://mygsa.company.com/search?q=Two&site=my_collection&client=my_frontend
curl http://mygsa.company.com/search?q=Three&site=my_collection&client=my_frontend
curl http://mygsa.company.com/search?q=Four&site=my_collection&client=my_frontend

Going one step further, you can use actual content or text from your site to pre-populate your list of suggestions.  The attached Java class will scan a text file and submit GSA queries for every single, double, or triple word phrase found in the file.  I suggest swiping some text from the site’s homepage or site map and paste it in to a text file as input.  (Note: the double and triple word phrases are not aligned grammatically, so they do not always make sense.  You can exclude them if you don’t like them.)
But I’m Impatient
If you are impatient like me, and don’t want to wait up to 24 hours for the suggestion database to be regenerated, this tip is for you.  In the GSA Admin Console, edit any existing Front End, or create a new Front End, and toggle the checkbox that says:

When you turn this checkbox off and then back on (saving the Front End in between), the entire suggestion database is regenerated immediately (for all scopes).  Depending on the size of your search logs, you could see refreshed suggestions in as few as a couple of second.  Much better than 24 hours!  (Note: Google’s JavaScript for query suggestions implements a in-memory cache, so you will need to reload the HTML page to see new suggestions)
Is this feature the same on the GSA and Google.com?
No – Google.com’s implementation is more sophisticated.  It takes into account additional information like your geographic location and search history.  The Google Search Appliance does not consider personal information when calculating the suggestions.  Each user searching the same Collection and the same Front End at the same access level will see exactly the same suggestions for a given query.
However, by using multiple Front Ends we can begin to simulate Google.com’s personalized suggestions.  If you know, for example, where a user is located, or what type of employee they are, your search application can route them to different and distinct Front Ends.  The Front Ends can be set up identically, but they will act as buckets to collect queries from each segment of the population.  Users will get suggestions from queries run by similar users, as opposed to queries run by the entire population.  It’s a brute-force solution, but for a limited number of permutations, it can be effective.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Chad Johnson

Chad is a Principal of Search and Knowledge Discovery at Perficient. He was previously the Director of Perficient's national Google for Work practice.

More from this Author

Follow Us