Searching Sitecore Page Datasources
When first enabling search in your Sitecore solution, you may find that page content is not searchable. This is especially true if you heavily rely on Experience Editor for your site assembly. Searching Sitecore page datasources is solved by a very common pattern that I refer to as “Page Visualization.”
Page Visualization involves creating a custom-computed field which crawls all data from all components on a page. You can inspect presentation details to do this, and it typically works out great.
I recently had an issue on one of my projects, however, where non-essential text such as “Share on Facebook” or “An error occurred” would reveal pages in search results. Lots of pages.
The Problem
Before I dig too deep into the solution, I want to share with you an actual example.
Imagine we have a form-style component: “Email this page to a friend.” This component includes some input fields with validators, an error message and a success message. In this example, you can see that we even allowed for content authors to put components within the error panel that would be displayed.
Now, this component was great and met our site’s needs. But then we discovered that we could search through all the text fields on this component. We also found we could search through any sub-components that were nested inside of our panels. For example, if this component were on a page and the user searched for “error,” then any page using this component would come back in search results. It was really annoying.
Why was this happening?
The way we were “visualizing” our pages for search results was essentially looping through all renderings on the page and then crawling all text fields on each rendering. A very common pattern, but a pattern that exposes potential complications with molecular component architecture.
The Solution
To solve this problem, I came up with a mechanism to allow components to “opt out” of being crawled and searched. I also had the requirement to allow molecular components to “opt out” their child components.
The trick? Create a base template that components can inherit from, and check for it at visualization time. It’s so simple, right?
Now that I have a template, I always create a custom item to reflect this template. It’s best practice to do this with your templates because it just makes things easier in the long run:
public class VisualizationExclusionBase : CustomItem { public static ID TemplateId = ID.Parse("{860CE90F-A13D-43E9-A52C-FC027ED8E822}"); public VisualizationExclusionBase(Item item) : base(item) { } public static bool TryParse(Item item, out VisualizationExclusionBase parsedItem) { parsedItem = item == null || item.IsDerivedFrom(TemplateId) == false ? null : new VisualizationExclusionBase(item); return parsedItem != null; } #region implicit casting public static implicit operator VisualizationExclusionBase(Item innerItem) { return innerItem != null && innerItem.IsDerivedFrom(TemplateId) ? new VisualizationExclusionBase(innerItem) : null; } public static implicit operator Item(VisualizationExclusionBase customItem) { return customItem != null ? customItem.InnerItem : null; } #endregion }
Now, for the meat and potatoes, we have a custom-computed field. In my particular case I’m using Solr, but this technique could apply to other technologies as well. To make a custom-computed field, I add it via configuration of my search index:
<configuration xmlns:patch="http://www.sitecore.net/xmlconfig/"> <sitecore> <contentSearch> <indexConfigurations> <customIndexConfiguration ref="contentSearch/indexConfigurations/defaultSolrIndexConfiguration"> <!-- ... --> <documentOptions type="Sitecore.ContentSearch.SolrProvider.SolrDocumentBuilderOptions, Sitecore.ContentSearch.SolrProvider"> <fields hint="raw:AddComputedIndexField"> <field fieldName="visualization" returnType="string">Playground.Base.Search.VisualizationField, Playground.Base</field> </fields> </documentOptions> <!-- ... --> </customIndexConfiguration> </indexConfigurations> </contentSearch> </sitecore> </configuration>
Some people overwrite the default Sitecore _content field, but I didn’t in this case. It should work either way, though, depending on your search logic. The actual logic of the computed field is pretty simple. You’ll also see a ton of extensions to help make the code readable. I love extensions for this reason. 🙂
public class VisualizationField : IComputedIndexField { public string FieldName { get; set; } public string ReturnType { get; set; } // iterate through all renderings on the page and grab the content of each that we want to be searchable public object ComputeFieldValue(IIndexable indexable) { var result = new StringBuilder(); var item = (SitecoreIndexableItem)indexable; if (item == null) { return string.Empty; } // Don't pull renderings that are nested within things inheriting from // Visualization Exclusion Base. Build up a collection of these items // based on the presentation of the page. we'll check these later to // manage any exclusions that we should respect var exclusions = item.Item.Visualization.GetRenderings( DeviceItem.ResolveDevice(item.Item.Database), false) .Where(x => item.Item.Database.GetItem(x.Settings.DataSource) .IsDerivedFrom(VisualizationExclusionBase.TemplateId)); // Get all renderings on the page foreach (var reference in item.Item.Visualization.GetRenderings( DeviceItem.ResolveDevice(item.Item.Database), false)) { // make sure rendering has a valid datasource if (!reference.HasDatasource()) continue; // make sure component isn't nested within any non-visualized components // we want to kill crawling of any molecular components that opt out if (reference.IsNestedWithinAny(exclusions)) continue; // pull the datasource and ensure it's valid for the current language var source = item.Item.Database.GetItem(reference.Settings.DataSource, item.Item.Language); if (source == null) continue; // make sure datasource should be visualized if (source.ShouldNotBeVisualized()) continue; // Go through all fields on datasource foreach (Field field in source.Fields) { // make sure we're looking at a custom text field, and not a system field // or something more complex like a treelist if (field.ShouldBeIndexed()) { result.Append(field.Value.StripHtml()).Append(" "); } } } // don't forget to dig through all fields on the page, to pick up any page level content foreach (Field field in item.Item.Fields) { // again, make sure the field we're looking at should be crawled if (field.ShouldBeIndexed()) { result.Append(field.Value.StripHtml()).Append(" "); } } // result now has all searchable text from the page return result.ToString(); } }
When it comes to extensions, you’ve probably found that you have lots of similar ones. I find that I carry my extensions around from project to project. Here are some of mine that I put to good use in this case:
public static class Extensions { // ability to check if an item or template inherits from another template // check https://laubplusco.net/sitecore-extensions-does-a-sitecore-item-derive-from-a-template/ for more! public static bool IsDerivedFrom(this Item item, ID templateId) { return TemplateManager.GetTemplate(item).IsDerivedFrom(templateId); } public static bool IsDerivedFrom(this Template template, ID templateId) { return template.ID == templateId || template.GetBaseTemplates() .Any(baseTemplate => IsDerivedFrom(baseTemplate, templateId)); } // check if rendering has a valid datasource public static bool HasDatasource(this Sitecore.Layouts.RenderingReference reference) { return reference.RenderingItem != null && !string.IsNullOrEmpty(reference.Settings.DataSource); } // check if an item inherits from visualization exclusion base public static bool ShouldNotBeVisualized(this Item item) { return item.IsDerivedFrom(VisualizationExclusionBase.TemplateId); } // check if a specific rendering is nested within any of the renderings passed in as // second argument in this case, I have dynamic placeholders set up and allow for unlimited // nesting of components (structurals, wrappers, etc) public static bool IsNestedWithinAny(this Sitecore.Layouts.RenderingReference reference, System.Collections.Generic.IEnumerable<RenderingReference> renderings) { foreach (var outerComponent in renderings) { var outerPlaceholder = outerComponent.Placeholder + "/"; // placeholders look like /main/left/placeholder // they can also be dynamic, like so: /main/left_9b5a4f2c/placeholder_8ae22dc1181 if (reference.Placeholder.StartsWith(outerPlaceholder)) return true; } return false; } private static readonly List<string> textFieldTypes = new List<string>(new[] { "Single-Line Text", "Rich Text", "Multi-Line Text", "text", "rich text", "html", "memo", "Word Document", "Raw Text" }); // check if: // 1. field can be easily crawled (e.g., is of text type) // 2. field is a system field (e.g., starts with __) // 3. field is from a base template which inherits from visualization exclusion base public static bool ShouldBeIndexed(this Field field) { // only allow text types to be indexed if (!textFieldTypes.Contains(field.Type)) return false; // exclude any sitecore system fields if (field.Name.StartsWith("__")) return false; // exclude any fields from an Index Exclusion Base if (field.Definition.Template.BaseIDs.Contains(VisualizationExclusionBase.TemplateId)) return false; return true; } // regex utility to strip html from a string public static string StripHtml(this string source) { var htmlRegex = new Regex("<.*?>", RegexOptions.Compiled); // remove tags and convert & to &, etc var removedTags = htmlRegex.Replace(source, string.Empty); return System.Web.HttpUtility.HtmlDecode(removedTags); } }
At this point, I simply have to go back to any template that I want to exclude from search results and change it to inherit from Visualization Exclusion Base. We now have a very clean way to control what is and isn’t searchable in our solution.
Hope this helps!
Really good approach. Big thanks!