In the application server world, clustering is typically implemented to promote redundancy and scalability, subsequently with a goal toward high availability—that is maintaining uptime on customer facing sites or services. It’s a common assumption that these same concepts apply when it comes to clustering on Adobe Experience Manager 6.x. Right?
Not so fast.
If you read the fine print, clustering on AEM is not recommended by Adobe on publish instances. You see, clustering on AEM introduces a new level of dependencies and complexity due to the reliance on MongoDB secondaries across data centers, and thus the argument is that clustering in AEM can decrease reliability and performance. That direction from Adobe pretty much throws the the whole high availability clustering use case we find in the application server world out the window.
Instead Adobe recommends stand-alone TarMK farms for failover of publish instances. Farms have better performance, are easily scaled, and because they are synced with the Author instance, are inherently fault tolerant. But where to cluster then?
There are several uses cases for clustering the AEM Author instances, but as you may have read in my other posts, I am not a fan of clustering author instances simply for failover. That can be achieved using TarMK Cold Standby’s with far less complexity and moving parts. See: AEM Infrastructure Series: Disaster Recovery Basics.
For me to recommend a clustering deployment, I like to see one of these use cases:
- Exceeding the authoring capacity limits of concurrent editors and contributors (the users accessing the authoring server). And (this is important) where sharding the author instance is not feasible. More about this later.
- Where regional performance of the authoring instance is important. Authoring in AEM tends to be “chatty” and low latency can vastly improve the perceived performance of an author instance.
- Where uptime of the authoring instance is critical—organizations that cannot survive even a few minutes of authoring downtime (I’ve yet to encounter this when you dig deep).
In simple terms, sharding is a technique used to split very large databases into smaller, faster more easily managed parts. An often overlooked option is to manually shard the authoring instances, that is to physically split the sites into separate AEM authoring instances. These could be:
- Physical sites, e.g. a primary www vs. a support or intranet site.
- Portions of existing sites, e.g. localized sites where live copies are not required.
- Separating global assets from the primary site authoring instance, e.g. a Global Corporate DAM.
Keep in mind that with independent author instances, each can have their own TarMK Cold Standby, be regionally located to reduce latency, and have separate maintenance cycles. And you don’t need Mongo DBA resources.
Sharding requires careful forethought and planning, often digging deep into the use cases and asking lots of questions and making decisions. This should be done during your initial AEM standup—questions and advice your AEM consultants should be providing.
Why performance tips? Because maximizing performance of your authoring instances can reduce or eliminate the need for clustering based on concurrency. Remember that Oak uses out of process memory for deserialization, speeding up performance. Thus where in a 5.x system we could get along with 8-12 GB of memory, in a 6.x system we recommend 64 GB with 8 GB of JVM. Also:
- Dedicated CPU cores can increase performance. We typically recommend 12 or 16 cores.
- SSD storage for the repository folder.
- If your internal infrastructure team is balking at that much SSD storage, using Oak FileDataStore to separate the document store to slower magnetic media, while leaving the node store on the faster SSD media. Note: we are still waiting for Adobe’s blessing on using the newer and more efficient Oak FileBlobStore. As of this writing it’s technically available, but not supported.
- Use Sling Offloading to offload high CPU jobs.
All of the above will help you extract more performance out of a single licensed AEM author instance, and thus further reduce the need for clustering due to concurrency.