Posts Tagged ‘Apache Spark’


Take advantage of windows in your Spark data science pipeline

Windows can perform calculations across a certain time frame around the current record in your Spark data science pipeline. Windows are SQL functions that allow you to access data before and after the current record to perform calculations. They can be broken down into ranking and analytic functions and, like aggregate functions. Spark provides the […]

Big Data Bootcamp

Big Data Bootcamp by the Beach: An introduction

This is a little story about nothing ventured; nothing gained. One day, I got a LinkedIn message asking if I would like to teach a Big Data Bootcamp at an event for the Universidad Abierta Para Adultos in Santiago de Caballeros, República Dominicana. Luis didn’t know me; he just saw my profile and saw that I’ve been […]

Hangzhou Spark Meetup 2016

Last weekend there was a meetup in Hangzhou for the Spark community, and about 100 Spark users or committers attended. It was great to meet so many Spark developers, users and data scientists and to learn about recent Spark community update issues, road maps and real use cases. The event organizer delivered the first presentation […]

How to Load Oracle Data into SparkR Dataframe

In the Spark 1.4 and onward, it supplied various ways to enable user to load the external data source such as RDBMS, JSON, Parquet, and Hive file into SparkR. Ok, when we talk about SparkR, we would have to know something about R. Local data frame is a popular concept and data structure in R […]

A Spark Example to MapReduce System Log File

In some aspects, the Spark engine is similar to Hadoop because both of them will do Map & Reduce over multiple nodes. The important concept in Spark is RDD (Resilient Distributed Datasets), by which we could operate over array, dataset and the text files. This example gives you some ideas on how to do map/reduce […]

How to Configure Eclipse for Spark Application in the Cluster

Spark provides several ways for developer and data scientists to load, aggregate and compute data and return a result. Many Java or Scala developers would prefer to write their own application codes (aka Driver program) instead of inputting a command into the built-in spark shell or python interface. Below are some steps for how to quickly configure […]

How to Setup Local Standalone Spark Node

From my previous post, we may know that Spark as a big data technology is becoming popular, powerful and used by many organizations and individuals. The Spark project was written in Scala, which is a purely object-oriented and functioning language. So, what can a Java developer do if he or she wants to learn about […]

Hangzhou Apache Spark Meetup

Similar to the Hadoop project, the Apache Spark project is a fast evolving in-memory engine for large-scale data processing platform. Particularly in recent years, Spark was widely used in many organizations and its community is being committed by many contributors. Perficient China GDC colleagues attended a recent Spark technology meetup in Hangzhou, during the meetup […]