Posts Tagged ‘Apache Spark’

Cloud Security In Offshore Software Development Projects

It’s good that Spark Security is turned off by default

Security in Spark is OFF by default, which means you are fully responsible for security from Day One. Spark supports a variety of deployment types, each with its own set of security levels. Not all deployment sorts are safe in every scenario, and none is secure by default. Take the time to analyze your situation, […]

Istock 1255683032

Deep Dive into Databricks Tempo for Time Series Analytics

Time-series data has typically been fit imperfectly into whatever database we were using at the time for other tasks. There are time series databases  (TSDB) coming to market. TSDBs are optimized to store and retrieve associated pairs of times and values. TSDB’s architecture focuses on time-stamp data storage and the compressions, summarization and life-cycle management […]

Istock 649839956

Key Components/Calculations for Spark Memory Management

Different organizations will have different needs for cluster memory management. For the same, there is no set of recommendations for resource allocation. Instead, it can be calculated from the available cluster resources.  In this blog post, I will discuss best practices for YARN resource management with the optimum distribution of Memory, Executors, and Cores for […]

InForm

Take advantage of windows in your Spark data science pipeline

Windows can perform calculations across a certain time frame around the current record in your Spark data science pipeline. Windows are SQL functions that allow you to access data before and after the current record to perform calculations. They can be broken down into ranking and analytic functions and, like aggregate functions. Spark provides the […]

Big Data Bootcamp

Big Data Bootcamp by the Beach: An introduction

This is a little story about nothing ventured; nothing gained. One day, I got a LinkedIn message asking if I would like to teach a Big Data Bootcamp at an event for the Universidad Abierta Para Adultos in Santiago de Caballeros, República Dominicana. Luis didn’t know me; he just saw my profile and saw that I’ve been […]

Hangzhou Spark Meetup 2016

Last weekend there was a meetup in Hangzhou for the Spark community, and about 100 Spark users or committers attended. It was great to meet so many Spark developers, users and data scientists and to learn about recent Spark community update issues, road maps and real use cases. The event organizer delivered the first presentation […]

How to Load Oracle Data into SparkR Dataframe

In the Spark 1.4 and onward, it supplied various ways to enable user to load the external data source such as RDBMS, JSON, Parquet, and Hive file into SparkR. Ok, when we talk about SparkR, we would have to know something about R. Local data frame is a popular concept and data structure in R […]

A Spark Example to MapReduce System Log File

In some aspects, the Spark engine is similar to Hadoop because both of them will do Map & Reduce over multiple nodes. The important concept in Spark is RDD (Resilient Distributed Datasets), by which we could operate over array, dataset and the text files. This example gives you some ideas on how to do map/reduce […]

How to Configure Eclipse for Spark Application in the Cluster

Spark provides several ways for developer and data scientists to load, aggregate and compute data and return a result. Many Java or Scala developers would prefer to write their own application codes (aka Driver program) instead of inputting a command into the built-in spark shell or python interface. Below are some steps for how to quickly configure […]

How to Setup Local Standalone Spark Node

From my previous post, we may know that Spark as a big data technology is becoming popular, powerful and used by many organizations and individuals. The Spark project was written in Scala, which is a purely object-oriented and functioning language. So, what can a Java developer do if he or she wants to learn about […]

Hangzhou Apache Spark Meetup

Similar to the Hadoop project, the Apache Spark project is a fast evolving in-memory engine for large-scale data processing platform. Particularly in recent years, Spark was widely used in many organizations and its community is being committed by many contributors. Perficient China GDC colleagues attended a recent Spark technology meetup in Hangzhou, during the meetup […]