The Hadoop Ecosystem Hadoop Ecosystem is a platform or a suite that provides various services to solve big data problems. It includes Apache projects and various commercial tools and solutions. 4 major elements of Hadoop are HDFS, MapReduce, YARN, and Hadoop Common. Hadoop is a framework that enables the processing of large data sets which […]
Posts Tagged ‘Apache Spark’
Top 5 take-aways from Databricks Data – AI Summit 2022
The Data and AI Summit 2022 had enormous announcements for the Databricks Lakehouse platform. Among these, there were several exhilarating enhancements to Databricks Workflows, the fully managed orchestration service that is deeply integrated with the Databricks Lakehouse Platform and Delta Live tables too. With these new efficacies, Workflows enables data engineers, data scientists and analysts […]
Analytics with Incorta using 3rd Party Tools
Incorta provides a comprehensive platform for data Acquisition, Data Enrichment, and Data Visualization. It can be a one-stop for all your data needs, but it can also be combined with other BI Visualization tools to enhance the experience even further. Incorta provides a Postgres connection to connect other 3rd party Visualization tools like Power BI, […]
It’s good that Spark Security is turned off by default
Security in Spark is OFF by default, which means you are fully responsible for security from Day One. Spark supports a variety of deployment types, each with its own set of security levels. Not all deployment sorts are safe in every scenario, and none is secure by default. Take the time to analyze your situation, […]
Deep Dive into Databricks Tempo for Time Series Analytics
Time-series data has typically been fit imperfectly into whatever database we were using at the time for other tasks. There are time series databases (TSDB) coming to market. TSDBs are optimized to store and retrieve associated pairs of times and values. TSDB’s architecture focuses on time-stamp data storage and the compressions, summarization and life-cycle management […]
Key Components/Calculations for Spark Memory Management
Different organizations will have different needs for cluster memory management. For the same, there is no set of recommendations for resource allocation. Instead, it can be calculated from the available cluster resources. In this blog post, I will discuss best practices for YARN resource management with the optimum distribution of Memory, Executors, and Cores for […]
Take advantage of windows in your Spark data science pipeline
Windows can perform calculations across a certain time frame around the current record in your Spark data science pipeline. Windows are SQL functions that allow you to access data before and after the current record to perform calculations. They can be broken down into ranking and analytic functions and, like aggregate functions. Spark provides the […]
Big Data Bootcamp by the Beach: An introduction
This is a little story about nothing ventured; nothing gained. One day, I got a LinkedIn message asking if I would like to teach a Big Data Bootcamp at an event for the Universidad Abierta Para Adultos in Santiago de Caballeros, República Dominicana. Luis didn’t know me; he just saw my profile and saw that I’ve been […]
Hangzhou Spark Meetup 2016
Last weekend there was a meetup in Hangzhou for the Spark community, and about 100 Spark users or committers attended. It was great to meet so many Spark developers, users and data scientists and to learn about recent Spark community update issues, road maps and real use cases. The event organizer delivered the first presentation […]
How to Load Oracle Data into SparkR Dataframe
In the Spark 1.4 and onward, it supplied various ways to enable user to load the external data source such as RDBMS, JSON, Parquet, and Hive file into SparkR. Ok, when we talk about SparkR, we would have to know something about R. Local data frame is a popular concept and data structure in R […]
A Spark Example to MapReduce System Log File
In some aspects, the Spark engine is similar to Hadoop because both of them will do Map & Reduce over multiple nodes. The important concept in Spark is RDD (Resilient Distributed Datasets), by which we could operate over array, dataset and the text files. This example gives you some ideas on how to do map/reduce […]
How to Configure Eclipse for Spark Application in the Cluster
Spark provides several ways for developer and data scientists to load, aggregate and compute data and return a result. Many Java or Scala developers would prefer to write their own application codes (aka Driver program) instead of inputting a command into the built-in spark shell or python interface. Below are some steps for how to quickly configure […]