SOLR Failure Brings Down Live Sitecore Site | Microsoft
Microsoft Blog

SOLR Failure Brings Down Live Sitecore Site

Recently, a client’s live Sitecore site was brought down completely by a failure in the SOLR search provider service. Immediately we tracked down the root cause of the issue and set to work on resolving it.

I’ll start with the root cause and continue on with the steps that were taken to prevent Sitecore from going down in the event of a SOLR failure.

Background: Sitecore instance is 8.1 update 2 (rev 160302). SOLR version was 5.2.0.

During the course of the day, the SOLR JVM was chewing up all available memory allocated to it. At that point the SOLR service stopped responding to requests, so while the server hosting SOLR was still online, the SOLR service itself was not responsive and throwing memory heap errors. We had thought that the JVM had been configured to use more memory, but it turns out we were incorrect. The default value of 512MB was still in place. To correct this we updated the SOLR configuration file to set the minimum/maximum value of the heap to 1024MB.

During the research into the problem we were seeing errors in the SOLR logs that looked like the following:

[indexname] PERFORMANCE WARNING: Overlapping onDeckSearchers=2

Looking into this error we found that it was common up through our version of SOLR and listed as resolved in SOLR 5.2.1 and newer. As part of our process to try and prevent future issues, a SOLR upgrade was performed and we moved to version 5.5.3.

Since the increase in the SOLR JVM’s minimum/maximum heap setting and the upgrade to SOLR 5.5.3, we have not seen the previous errors or any additional errors in our logs.

On to the Sitecore side of things.

Why did the site go down? What can we do to prevent this?

This was a known issue and resolved in Sitecore 8.0 update 1 with reference number 391039. I reached out to Sitecore support with this issue and they provided me with a link to a patch hosted on the Sitecore Support GitHub page here. The patch adds in a monitor that, by default, checks the connection to the SOLR cores every 10 seconds. This check time is configurable by updating the Interval setting in the Sitecore.Support.391039.config file contained in the patch.

If SOLR is unavailable, this patch will allow the site to stay online and some functionality will be impacted, but it won’t be a complete outage. For example, on a CM server if SOLR is not available and FXM is being used, the site list will fail to populate. The patch also writes to the Sitecore log files and prefixes each message with “SUPPORT: Connection to [<core name>] Solr core . . .”. The logs message indicate whether the connection was established, failed, or was restored. These can help determine if SOLR is unavailable and needs to be investigated.

In addition to the Sitecore.Support.391039.dll and the Sitecore.Support.391039.config file provided in the zip file on the GitHub link, steps 3 and 4 of the ‘Deployment’ section outline the need to update additional values in the SOLR configs.

I opted to handle the updates mentioned in step 3  (step 4 isn’t in place in our instance) with a separate patch file. A patch of a patch if you will.

Since CM and CD servers access different indexes I created a CM version and CD version with the relevant patching.

CM patch:

CD Patch:

With the JVM memory increase, the SOLR upgrade, and the Sitecore support patch we can now avoid those 1 a.m. calls that the site has gone down!!

Please feel free to reach out to me here or on Twitter @bill_cacy with any comments, suggestions or ideas on topics you would like to see covered.

Subscribe to the Microsoft Weekly Digest

* indicates required

Leave a Reply

Perficient Microsoft Blog

Insights, best practices and technical perspectives to help you leverage your investment in Microsoft technology solutions to power your business growth