Adobe

Attack of the AEM Link Checker!!

Attack Of The Linkchecker Cropped

Nearly every user of Adobe Experience Manager underestimates the AEM Link Checker. Most people think of the AEM Link Checker as that annoying feature that incorrectly strips links in AEM. But, it can do far more.
Not only will the AEM Link Checker remove links and incorrectly flag links as broken, but it can also bring an AEM instance to its knees.
This isn’t to say that the idea of having a tool to check links is a bad idea. A good crawler, like Screaming Frog, is a vital tool in every digital marketer’s toolbox, but why is it run on every request?

AEM Link Checker in the Wild

Recently, we had this happen with an AEM instance. The instance had externalized links in the navigation so that the navigation could be used on multiple sites. As additional pages were brought into AEM, the load the AEM Link Checker inflicted upon the instance increased geometrically. This eventually leads to severe performance problems.
Initially, we assumed that the increasing performance problems were due to an errant query or Java Filter. However, a particular heap dump told a very different story.

java.lang.Thread.State: RUNNABLE
at java.util.AbstractCollection.containsAll(AbstractCollection.java:317)
at java.util.AbstractSet.equals(AbstractSet.java:95)
at com.day.cq.rewriter.linkchecker.LinkInfo.isSame(LinkInfo.java:228)
at com.day.cq.rewriter.linkchecker.impl.LinkInfoStorageImpl.putLinkInfo(LinkInfoStorageImpl.java:375)
at com.day.cq.rewriter.linkchecker.impl.LinkCheckerImpl.getLink(LinkCheckerImpl.java:275)
at com.day.cq.rewriter.linkchecker.impl.LinkCheckerTransformer.startElement(LinkCheckerTransformer.java:289)
at org.apache.cocoon.xml.sax.AbstractSAXPipe.startElement(AbstractSAXPipe.java:97)
at com.day.cq.mcm.core.newsletter.NewsletterTransformerFactory$NewsletterTransformer.startElement(NewsletterTransformerFactory.java:132)
at com.day.cq.rewriter.htmlparser.DocumentHandlerToSAXAdapter.onStartElement(DocumentHandlerToSAXAdapter.java:105)
at com.day.cq.rewriter.htmlparser.HtmlParser.processTag(HtmlParser.java:640)
at com.day.cq.rewriter.htmlparser.HtmlParser.update(HtmlParser.java:343)
at com.day.cq.rewriter.htmlparser.HtmlParser.write(HtmlParser.java:196)
at java.io.Writer.write(Writer.java:192)
- locked <_0x00000006aab74560> (a com.day.cq.rewriter.htmlparser.HtmlParser)
at java.io.PrintWriter.write(PrintWriter.java:456)
- locked <_0x00000006aab74560> (a com.day.cq.rewriter.htmlparser.HtmlParser)
at org.apache.sling.scripting.core.impl.helper.OnDemandWriter.write(OnDemandWriter.java:75)
- locked <_0x00000006aab9c3c0> (a org.apache.sling.scripting.core.impl.helper.OnDemandWriter)
at java.io.PrintWriter.write(PrintWriter.java:456)
- locked <_0x00000006aab9c3c0> (a org.apache.sling.scripting.core.impl.helper.OnDemandWriter)
at org.apache.sling.scripting.core.impl.helper.OnDemandWriter.write(OnDemandWriter.java:75)
- locked <_0x00000006aab9c428> (a org.apache.sling.scripting.core.impl.helper.OnDemandWriter)
at java.io.PrintWriter.write(PrintWriter.java:456)
- locked <_0x00000006aab9c428> (a org.apache.sling.scripting.core.impl.helper.OnDemandWriter)
at java.io.PrintWriter.write(PrintWriter.java:456)
- locked <_0x00000006aab9c478> (a java.io.PrintWriter)
at java.io.PrintWriter.write(PrintWriter.java:473)
at org.apache.sling.scripting.sightly.apps.example 
The Digital Essentials, Part 3
The Digital Essentials, Part 3

Developing a robust digital strategy is both a challenge and an opportunity. Part 3 of the Digital Essentials series explores five of the essential technology-driven experiences customers expect, which you may be missing or not fully utilizing.

Get the Guide

To confirm, I reviewed the logs and then grepped the error log to confirm what I was seeing:

grep -wc 'External links for host .* has reached the maximum number of' error.log

Shockingly, this returned over 1,300,000 instances of the log message over the last 24 hours. In order to determine what domains were causing the issues, I then ran another command to just find the unique messages:

grep 'External links for host .* has reached the maximum number of' error.log | sort --unique

From there, I ran the original grep command with specific domains to determine what domains were most responsible.

Saving AEM from the Link Checker

Ideally, the AEM Link Checker should not be enabled in production instances to ensure that it does not impact performance. If this is not an option due to the potential for other side effects, you can configure the “Link Check Override Patterns” in the “Day CQ Link Checker Service” as described in this Adobe HelpX Article.
For instance, to disable checking of the domain www.example.com, you could use a regular expression like:

^https?:\/\/www\.example\.com

After configuring the AEM Link Checker to ignore the indicated domains, the AEM instance immediately returned to a stable state.
Hopefully, you’ve already disabled the AEM Link Checker in your production instances. If not, I hope this article helps you identify when it causes performance problems and helps resolve the problem.

About the Author

Dan is a certified Adobe Digital Marketing Technologist, Architect, and Advisor, having led multiple successful digital marketing programs on the Adobe Experience Cloud. He's passionate about solving complex problems and building innovative digital marketing solutions. Dan is a PMC Member of the Apache Sling project, frequent Adobe Beta participant and committer to ACS AEM Commons, allowing a unique insight into the cutting edge of the Adobe Experience Cloud platform.

More from this Author

Thoughts on “Attack of the AEM Link Checker!!”

  1. Hey Dan,
    Great details in the article. More or less I am seeing the same issue. But do you think, what ever the AEM version is, does the performance issue still remains the same with the link checker, if we have large content?
    Thanks,
    Arvind

  2. Hi Dan ,
    I am also seeing similar issues in our 6.3 Prod environment for the linkchecker. Our aem instance hosts 4 sites and each of those sites have external links in the footers on the page and additionally our content creators seem to generate external links our sites which should be made relative.
    As of now I am approaching to add the OSGi pattern filter to avoid linkchecker going over the externalized links generated by our content creators which should be externalized and am also thinking of trying to add markup tags for aem to skip processing the footer external links.
    I am considering deleting the nodes (associated with the external urls that are erroing) that are also generating under /var/linkchecker as this is the cached links that are checked so that linkchecker doesn’t iterate through them. Any thoughts on removing this excess nodes that we will no longer check?
    Thanks

  3. It shouldn’t pose an issue, but as with any production system, I would be cautious. If you have multiple instances, I would definitely do it one at a time and even consider taking them temporarily out of rotation so that the link checker for sure won’t be reading / writing to the node structure while you’re doing the deletion.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Subscribe to the Weekly Blog Digest:

Sign Up