Similar to the Hadoop project, the Apache Spark project is a fast evolving in-memory engine for large-scale data processing platform. Particularly in recent years, Spark was widely used in many organizations and its community is being committed by many contributors. Perficient China GDC colleagues attended a recent Spark technology meetup in Hangzhou, during the meetup the developers shared the challenge, best practice and technical insight when to apply Apache Spark.
First topic regarding Spark architecture, deployment approach & key issues was delivered by an architect from local company who published a newest book of Apache Spark code analysis. Many slides were talking about how Spark manage memory, disk, CPU and network and how to plan the execution strategy in application, job, stage and task level. The scheduling options including FIFO and Fair will result in different performance. For example, when applying the Fair strategy in loading form Spark to Cassandra, it is much faster.
The second topic is focus on Spark SQL and use case in Huawei, a company dedicated in telecommunication solutions. The Spark 1.3 brings some nice features such as DataFrame API which enables the developer to write much more simple code; Catalyst is optimized in the automatic way; DataSource API is extended to connect to more sources like Parquet, JDBC, JSON, HDFS, Amazon S3, Redshift, H2, HBase, etc.
An interesting DEMO of SparkSQL on Cube was presented:
The data is about 12 billion records, 1.5TB size, to be processed by four 16-cores worker nodes. The use case is to query the min/max data traffic in the China telecom ‘group by’ each province. The result returns in 3 seconds, really amazing!
Another Spark use case was shared in graph analysis. As many may know graph analysis is one of the data mining techniques. Imagine there is 1 billion users for China mobile, if we want to analyze the association and classification among all of those users, how many computation resources are needed? Back to the history, the Hadoop and traditional M/R was used in the graph analysis practice initially. But with Spark evolvement and strengthen, more and more computing platform was migrated to Spark.
This actually brings up an interesting argument – will Spark replace Hadoop MapReduce in the future? I personally don’t think so. Currently the Hadoop is dominant product in terms of big data implementation and more and more vendors are integrating with Hadoop. Spark has its advantage but both of them are part of the ecosystem.
In the Perficient big data lab, Apache spark is a key component along with other key products in the HDP platform version 2.2. We look forward to learning & practicing more in the Spark technology.