I implement distributed persistence solutions for enterprises. For use cases involving highly-available, low latency processing, I typically use DataStax Enterprise. Apache Cassandra provides great persistence characteristics for a real-time, distributed transactional store. For real-time stream processing, I’m a Spark fan. Throw in Solr and graph, add in operational support, and I have all of my favorite open source projects bundled, hardened and optimized. DSE 6 was just released and claims to double their performance. I was suspicious, for good reason, because both Spark and Cassandra are sensibly architected for high performance. I wanted to see what, if any, fundamental changes had been made to justify this claim. There were two architectural changes that I thought were interesting: moving from SEDA to TPC and introducing continuous paging. Part one will deal with migrating from SEDA to TPC while part two will deal with CP.
Cassandra is a distributed log-structured merge tree that provides a sorted map of sorted maps for data persistence. When you write data, the data is appended to a commit log on disk and then written to a memtable in RAM where its sorted and periodically flushed to an SSTable on disk. SSTables are periodically reordered and cleaned up during compaction. A read request first looks in the memtable for a fast win. If it needs to find the data on disk, it checks a row cache, then a bloom filter, a partition key cache and finally a partition keys summary before looking up the compression offset in memory and then ultimately getting the data from disk. So while data is stored on disk on append logs, this is so efficient that performance advantages come from the CPU. Concurrency performance is directly tied to CPU cores. Once data is written to disk, the bulk of the work is done in memory so performance improvements due to memory are related to the amount of hot data you need. For use-case agnostic performance increases, there needs to be an architectural change that leverages the CPU.
SEDA versus TPC is fundamentally another threads-versus-events discussion. Threads are great because they are well-known and understood entities of operating systems, are general purpose primitives for any kind of parallelism and are mandatory for exploiting true CPU concurrency. Threads are terrible because the nondeterministic preemptive scheduling of multi-threaded code leads to a lack of understandability and predictability, rendering a reasonable execution analysis of concurrent code virtually impossible. Events are great because having a single-threaded event loop that waits for events to invoke handlers using a deterministic callback mechanism provides a straightforward parallelism model that is easier to reason about. Events are terrible because of development overhead of large cascading chains and the need to manually handle the recovery of state between callback rather than putting state on the stack. Threads suffer from deadlocks and livelocks while events have long running CPU-bound callbacks, blocking and non-yielding callbacks. In other words, programming is hard, the answer is always “it depends” and blogs like this don’t help.
SEDA, staged event-driven architecture, is a framework for highly concurrent server applications that combines event-based and thread-based models. It allows developers to break their logic into a series of stages with each stage having a small and dynamically-sized thread pool to process incoming events and then pass these events to other stages. Dynamic resource throttling allows for the system to adapt to overload by controlling scheduling and resource allocation of these modular components. Breaking down a complex, event-driven process into a series of well-defined, debuggable, reusable modules to develop concurrent applications is a Very Good Thing ™. So why change? I feel better letting Matt Welsh describe what he feels like he would do differently today. Basically, 1999 was in the before time: linux threads didn’t scale, multi-core servers were rare and web pages were static. The issue was scalability, not latency. The single-threaded event-loop model fits for mostly I/O-bound applications, but it’s very difficult out of the box to take advantage of real CPU concurrency and utilize multiple cores. Garbage collection was always an issue so its not surprising that most implementation suffer from longer time to safepoint than you want in a low latency application.
Matt Welsh is certainly guilty of failing to optimize for things that didn’t exist at the time, but where does that leave us with Cassandra? Benedict Elliot Smith suggested that moving from SEDA to thread per core architecture would solve a wide range of performance problems. About two years later, Aleksey Yeschenko created the CASSANDRA-10989 JIRA meta-ticket to track all of the changes that would be needed for such a major architectural overhaul. Essentially, under a TPC architecture, each CPU core becomes a logical shared-nothing instance of Cassandra and each thread pinned to that core would have a single event loop tasked with serving requests as well as scheduling and maintenance duties. Since the SEDA architecture was not broken, what was this thread per core model supposed to provide? The classic events-versus-threads argument in this case boils down to events driven architecture can’t easily handle multi-core CPUs (which is critical to performance) while it’s very easy to do multi-threaded code wrong (which is critical to not being wrong).
The key here is that we are actually talking about a single thread model; we are agreeing that multi-threaded code is hard. We’re just doing a lot of single thread instances in an asynchronous distributed network and relying on Paxos to deliver the (tunably) correct answer to the client. Each instance is just responsible for processing a slice of data on the node. Once you aren’t trying to accomplish synchronization in a multithreaded environment, you aren’t doing compare and swap operations. No race conditions. More efficient and debuggable code. And particular to Cassandra, it’s easier to move non-concurrent data structures off-heap. In Cassandra’s use case where it makes sense to keep a lot of data around without running into a stop-the-world garbage collection scenario, making it easier and faster to move objects off-heap is worthwhile. In particular, key cache, row cache, page cache and memtable can be kept offheap thereby taking the objects in the read path out of bounds for garbage collection. Speaking of garbage collection, fewer threads means fewer roots. Since we have fewer threads we have fewer context-switches and thereby fewer cache misses.
In 1995, John Ousterhout published the influential Why Threads are Bad, which strongly influenced the threads-vs-events debate. At the end of the paper he cautioned to “Only use threads where true CPU concurrency is needed”. Agreed.