Go grab a stack of papers from the “TO-DO” pile on your desk – it’s OK, I know you have one – mine is full of utility statements to be filed away, drawing from my kids, and coupons for things I will probably never buy. Now find the cable TV bill from April 2016.
Not very difficult, huh? We (humans) are remarkably good at classifying information. We can very quickly tell the difference between a cable TV bill and a water bill. We can quickly decide if something is “What we are looking for” or “Not what we are looking for.” We don’t even break a sweat while finding a single, precise document out of hundreds of random papers. We have learned how to do this intuitively – all the way back to sorting leaves and sticks and acorns into piles or selecting the “best” slice of cake.
Artificial Intelligence is very good at similar types of problems – deciding if something is or is not what the program has been trained to recognize, or sorting data into arbitrary piles based on distinguishing characteristics. In machine learning, this is called either classification (supervised learning) or unsupervised learning. These techniques are some of the fundamental building blocks of intelligence, and they are being exploited by computer scientists to solve many real-world problems, from recognizing objects in pictures to recommending books you might like based on your reading habits.
I feel confident that, given time and maturation, artificial intelligence will eventually replace traditional document search algorithms using similar techniques. And with the acceleration of progress in the field, this could come sooner than we expect.
First, a little background on what I mean by traditional document search algorithms. Computers have long been able to search and rank documents when searching by keywords or terms in the documents. Statistical algorithms evaluate the frequency of common and uncommon words in the documents in an attempt to find documents that have the most significant information about your query terms. Going farther, they can look at factors like the position of the terms relative to the structure of the document to assess the importance of different information. They can even watch user behavior to better select documents for a certain query.
So, go back to the “TO-DO” pile trick. Imagine I asked you to look through 10’s of millions of documents on your corporate server and find the most recent orientation presentation that was given to new hires. It might take you a very, very long time, and you might hate my guts when you are done, but I would argue that you could do it. Given enough time and patience, search could be done by humans using intelligent classification and unsupervised learning.
Now assume, as predicted by numerous very smart computer scientists and analysts, that artificial intelligence quickly reaches the point of being as good as (or better than) humans at generalized classification and unsupervised learning and other intelligent techniques. It seems reasonable to predict that computers will start “looking through 10’s of millions of documents and finding the best / right one” just like that poor intern that now hates my guts. This is still search, I admit, but it’s an entirely different approach – potentially the biggest shift in approach since the invention of algorithms like tf–idf weighting schemes in the mid-1970’s. Instead of focusing on just the words and concepts within a document, this is teaching computers to look at an entire document and compare everything about the document to every other document they have seen in the past and make a determination as to whether or not it is a good match for a certain query.
At Perficient, we are watching this potential transformation carefully. Few, if any, software vendors are talking to me about such techniques yet. It feels too far off to be part of published roadmaps. But I feel like in as little as a year or two, there will be major search companies touting this approach as their primary approach to document selection. We will continue to build relationships with as many search companies as possible and be the canary in the coal mine for this emerging technology. I have a feeling that it will be very big news when it ultimately arrives.
Image source: Licensed under Creative Commons Attribution 2.5 Denmark, author: Johannes Jansson