LinkedIn open sources a lot of code. Kafka, of course, but also Samza and Voldemoort and a bunch of Hadoop tools like DataFu and Gobblin. Open-source projects tend to be created by developers to solve engineering problems while commercial products … Anyway, LinkedIn has a new open-source data offering called OpenHouse, which is billed as […]
Posts Tagged ‘Hadoop’
Hadoop Ecosystem Components
The Hadoop Ecosystem Hadoop Ecosystem is a platform or a suite that provides various services to solve big data problems. It includes Apache projects and various commercial tools and solutions. 4 major elements of Hadoop are HDFS, MapReduce, YARN, and Hadoop Common. Hadoop is a framework that enables the processing of large data sets which […]
Take advantage of windows in your Spark data science pipeline
Windows can perform calculations across a certain time frame around the current record in your Spark data science pipeline. Windows are SQL functions that allow you to access data before and after the current record to perform calculations. They can be broken down into ranking and analytic functions and, like aggregate functions. Spark provides the […]
OLAP and Hadoop: The 4 Differences You Should Know
OLAP and Hadoop are not the same. OLAP is a technology to perform multi-dimensional analytics like reporting and data mining. It has been around since 1970. Hadoop is a technology to perform massive computation on large data. Around since 2002. They can be used together but there are differences when choosing between using Hadoop/MapReduce data […]
Big Data Bootcamp by the Beach: An introduction
This is a little story about nothing ventured; nothing gained. One day, I got a LinkedIn message asking if I would like to teach a Big Data Bootcamp at an event for the Universidad Abierta Para Adultos in Santiago de Caballeros, República Dominicana. Luis didn’t know me; he just saw my profile and saw that I’ve been […]
How to Load Log Data into HDFS using Flume
Data acquisition is a very important part of building a big data ecosystem. Data acquisition allows you to extract various types of data such as a file, DB, streaming, web page etc. If you are just setting up your local environment, not in the real business scenarios, you can resolve data acquisition by making use […]
2 Choices for Big Data Analysis on AWS: Amazon EMR or Hadoop on EC2
What are the key differentiators to determine Hadoop distribution for Big Data analysis on AWS? We have two choices: Amazon EMR or a third-party provided Hadoop (ex: Core Apache Hadoop, Cloudera, MapR etc). Yes, cost is important. But, aside from cost, other things to look for include ease of operation, controlling, managing, performance, features etc. 1. Cost […]
The Year in Review | Top 10 EIS Posts of 2015
It’s been a busy year in the Enterprise Information Systems space. With over 75 posts this year, our in-house experts found themselves face to face with big changes and an abundance of great information to share. We sifted through that content and present to you the Top 10 EIS posts of 2015. Ten | […]
Time Well Spent in 2015
The end of 2015 is fast approaching, with December looming just a week away. For most people, December is packed with the hustle and bustle of last-minute gift shopping, or end-of-year projections and budgets for 2016. Often in the sway of all this activity, many are so focused on the approaching New Year that they […]
Dorothy in the Land of Big Data
Big Data is one of the enabling technologies for companies to digitally transform either their operations and/or customer interactions. However the open source world can be complicated, especially in the red hot Big Data arena. There are a myriad of technologies; some compete with one another, others overlap, some are complementary, and worse of all, […]
Hadoop, Spark, Cassandra, Oh My!
Previously, I reviewed why Spark will not by itself replace Hadoop, but Spark combined with other data storage and resource management technologies creates other options for managing Big Data. Today we will investigate how an enterprise should proceed in this new, “Hadoop is not the only option” world. Hadoop, Spark, Cassandra, Oh My! Open source Hadoop and […]
IBM’s Spark Investment is Evidence Big Data is Dead
Right after I posted my blog on Spark and Hadoop, I came across this article. IBM made a big announcement that they are putting their weight behind Spark. They are committing more than 3,500 developers and programmers to help move Spark forward. This combined with significant support from the Big 3 Hadoop distributors (HortonWorks, Cloudera, […]