Skip to main content

Data & Intelligence

Hangzhou Spark Meetup 2016

Last weekend there was a meetup in Hangzhou for the Spark community, and about 100 Spark users or committers attended. It was great to meet so many Spark developers, users and data scientists and to learn about recent Spark community update issues, road maps and real use cases.

The event organizer delivered the first presentation about Spark 2.0, yet, we don’t know when Spark 2.0 will be released as a big improvement in many aspects. There will be some exciti111ng features and improvements in this version. The most highlighted one is to integrate with Tungsten, which is an engine being built upon ideas from modern compilers and MPP databases and applies them to data processing. With it, the new Spark core engine will combine all queries into the single function and eliminate virtual function, which reduces the CPU computing time and improves utilization. Post Spark 2.0 Dataset and Dataframe are unified but the user can use either of them to do MR programming. It adopts much of other streaming framework to improve Spark streaming, which will provide the ability for the user to handle dynamic, infinite data, such as the uncertain internals when the streaming data come in. There is also good news for MLLib, it will separate the model with the engine by offering the ability to load and save machine learning pipelines and models.

In the second topic, a data scientist shared some use case practices and optimizations with Spark MLLib. Actually, Spark has offered pretty many algorithms and models such as basic statistics, classification, and regression, collaborative filtering, clustering, text analytics etc, but each type of ML algorithm will be suitable for the different business case. For example, collaborative filtering is mainly utilized in the product recommendation to the customers; Clustering is more for social networking. The first use case is to build a system for the product recommendation, some common steps are Data extraction, Match and then rank, these steps can be applied in different scenes. Second use case is to build an offline and online computing framework. Offline part was utilized via ODPS, YARN and MLLib while online part was done with HBase and Spark streaming.

During the discussion time, I was also introduced to many other big data practices with Sparks, some are using the old version and others are doing with latest one V1.6. As a complete system, it is very rare to just use Spark only, more often the enterprise will introduce several components such as HDFS, HBase, Impala, Flink, and Storm together with Spark.
Welcome to join the Spark, Hadoop, and Big data community:

Perficient Big Data Community

Spark

Spark events and meetup

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Kent Jiang

Currently I was working in Perficient China GDC located in Hangzhou as a Lead Technical Consultant. I have been with 8 years experience in IT industry across Java, CRM and BI technologies. My interested tech area includes business analytic s, project planning, MDM, quality assurance etc

More from this Author

Follow Us