Last week I wrote about three advantages of cloud-based search engines. While the future looks bright for these tools, there are still potential downsides. This week I would like to discuss three disadvantages of cloud-based search engines, as well as potential mitigations. I am not saying that you should not consider cloud-based search engines because of these reasons. I am simply pointing out some reasonable concerns in the hope that you can make the most educated decision for your company.
1. Lack of Control
Not having to worry about the physical architecture of a search engine means that certain aspects of the implementation will be out of your control. If a cloud-based search engine is not performing fast enough, you cannot simply add another server instance or increase the RAM. If your users are geographically distributed, you cannot simply place a replica server in another location. Cloud services should perform well and they should offer geographic dispersion, but if they don’t, you are unlikely to be able to do anything about it yourself. You are at the mercy of the cloud for those types of configuration changes or improvements.
Mitigation: Some cloud-based search engine vendors do provide visibility into the underlying “hardware” and can adjust to some characteristics upon request, like size, scale or location. Ensure that your vendor’s performance promises meet your needs, or that they have a plan to resolve issues that arise in the future.
Another loss of control is around patches, upgrades, and outages. If you are running your own software on-premise, you can control if and when you install updates or patches, and you can thoroughly test them before promoting them to production. Cloud-bases services sometimes offer a vetting and approval process for new revisions and features, but not all of them do. Salesforce.com, for example, is very good about sandboxing new features and letting customs choose exactly when to enable them in production. Google takes a different approach, often pushing changes to their cloud applications with advanced notice, but no choice as to if or when. In either case, you rarely have the luxury of holding off updates or new features indefinitely like you can with on-premise software. The cloud advances quickly, and so will your search engine, whether you want it to or not.
Mitigation: Ask your search vendor how new features, patches, and updates are announced, staged and deployed. Ensure that their process meets your own needs for testing and change management.
2. Data Privacy
If you are indexing sensitive or private content, the security of that content will be of the utmost importance to you. While search engines typically index documents and then immediately discard the original content, some artifacts, such as thumbnails, high-fidelity previews, or translations, stick around for the life of the index. Even the act of getting the document up to the cloud is a potentially vulnerable situation.
Mitigation: Most cloud search engines encrypt all content in transit and at rest. Confirm with your vendor that all artifacts of your content are encrypted and double-check who has access to the private keys. Analyze the content-indexing workflow and ensure that there are no vulnerable spots on the journey from ground to cloud.
Even if you are not indexing private content, cloud search engine providers could theoretically have access to sensitive usage information. All search terms submitted by your users could potentially be visible to the search engine provider, even if the queries are made over secure connections. The vendor developed the software, has access to the underlying systems, and they are the API endpoint, so they could have access to the queries being submitted.
Mitigation: If your are considering a “search-as-a-service” product, where all queries are submitted to a common API or service, you need to discuss with the vendor who has access to the API traffic logs and what their policy is regarding access to that information. You might consider a private cloud implementation, if your vendor offers that option. It will decrease the chance of anyone outside of your organization being able to see your content or search traffic, but most likely at a greater cost.
3. Network Bandwidth
As I mentioned in last week’s article, the content sources being indexed by enterprise search engines are steadily moving into the cloud, making cloud-based search engines a natural fit. But not everything is moving. Some repositories may live on-premise for the foreseeable future, and some of these have extremely large amounts of content that need to be indexed. For example, digital video management systems, local network storage, and user PC files are less likely to move to the cloud because of the need for very fast, real-time access. The process of indexing this content with a cloud search engine inevitably requires uploading all of the content to a remote location. There is virtually no way around this. Indexing the content on-premise would defeat many of the advantages discussed in the previous article. Even worse, upload bandwidth is typically more precious than download bandwidth. Indexing terabytes of data in a cloud search engine could put a big strain on a company’s internet bandwidth.
Mitigation: Data compression and traversal efficiencies can help this situation, but not entirely. Investigate alternative internet network architectures to reduce the impact of large volumes of upload traffic, such as symmetrical networks (which do not favor download bandwidth over upload bandwidth) or separate network segments (to keep upload traffic from slowing down other traffic). Some providers might even allow large archives to be delivered to the cloud in one-time physical hard drive deliveries.
Like I mentioned before, I am not saying that these disadvantages are deal-breakers. They are merely realities with cloud-based enterprise search engines. It will be interesting to see how each vendor deals with these problems as they engineer their products and promote them in the marketplace.
Great post Chad, especially the advice on mitigating these factors (which can easily be done for some platforms but not all).
Regarding #3 – I find that a search engine’s approach to accessing on-premise content and how they perform “incremental” indexing is key. I have also found that many cloud based search engines don’t have a solution for accessing on-premise content and require the duplication and copying you speak of. Most search engines incrementally update their index but must fully re-crawl the content each time. But platforms like Coveo can both deploy an on-premise connector to avoid duplication and perform truly incremental updates. This allows for querying the systems directly (even on-premise) for a list of new/changed and then only accessing that subset of data, thus eliminating the need for duplication or the pain of network bandwidth consumption.
Again, very good post series with the pros/cons of each and more importantly how to fully maximize the pros while mitigating the cons!