Inspired Digital Experiences for Manufacturing & Automotive
Whether you’re just beginning your digital transformation journey or are well on your way, we invite you to explore our partnership with Adobe and our diverse capabilities in manufacturing and automotive.
Recently, we were investigating a CPU performance spike issue with an Adobe Experience Manager (AEM) publish server. After some research, we came across logs that indicated indexing had caused the CPU spike.
Adobe Experience Manager is more than just a content management system or an application to serve content to the user’s request. AEM includes more powerful functionality, such as Apache Lucene indexing, which enable full-featured text searches across content in the repository.
Behind the scenes, Apache Lucene fetches the documents in the repository and indexes the content based on the metadata and text content. The index update thread wakes up every five seconds looking for content updates. Apache Lucene uses Apache Tika, a content analysis tool, to get the internal detail of documents like metadata and text in the document to create the indexes.
In a real world scenario, many companies do not rely on AEM search functionality. Companies opt for enterprise-wide search implementations like Adobe Search and Promote or Apache Solr. In these scenarios, all text parsing is handled by third-party engines. Now the question is, do we need to continue with Apache Tika parsing the documents in AEM? The answer is no. It is not required, and by disabling Apache Tika parsing inside AEM, we can reduce the CPU spike.
So, how do you disable document parsing by Apache Tika inside AEM? You don’t even need to disable the Apache Tika bundles. Just like configuring the parser in XML format, in AEM we need to do simple configuration under Oak Index Lucene node.
To disable Apache Tika document indexing in AEM, follow these steps:
- Open CRXDE lite
- Navigate to /oak:index/lucene
- Under lucene node create an nt:unstructured node named tika
- Under tika node, create file node named config.xml
- Open the config.xml, add the below entry:
<properties> <parsers> <parser class="org.apache.tika.parser.EmptyParser"> <mime>application/zip</mime> <mime>application/msword</mime> <mime>application/vnd.ms-excel</mime> <mime>application/pdf</mime> </parser> </parsers> </properties>
- Repeat the step 3 – 5 for /oak:index/damAssetLucene
- Now save everything.
In the above example, we are disabling the text extraction from Zip, MS-Word, MS-Excel and PDF files. During indexing these files will be ignored for text extraction.
Below is the image showing the configuration:
You can find a complete list of content types on the IANA website, add the type you want to exclude in step five. Based on the above example, you can add the list of MIME Type that you feel can be ignored for text extraction.
Please leave a comment below if you have any questions about indexing or performance related issues.