Skip to main content


Big Data Bootcamp by the Beach: Getting Started Smart

getting started

In the first post in this series, I talked about giving a Big Data Bootcamp in the Dominican Republic to a large group of very smart students. In this post, I’ll go over the basic tools and techniques that I think are most relevant in the job market. These are basic tools that most are already familiar with, but I want to show how these basic tools get extended when you are working with a team. My goal is to have everyone become familiar with a particular family of tools, rather than the specific tool they might get in an enterprise. For example, I picked GitHub over BitBucket. Atlassian has an enterprise grade ecosystem, but it isn’t free across the board, GitHub has a marketplace that is sufficient to understand the basics of CI/CD cycle.

Think Local


The preferred method for doing new things that may not work out well is to do it in a disposable container. This has two great advantages: you can’t trash your machine and you can send the image to someone else once it works and it will definitely work on their machine.


Code not in a repository is myth. We will be using git and GitHub for source control and the GitHub Flow process in this tutorial.  I recommend going through the Git from the CLI training to understand how to work with git and github from the command line.

Act Global

Docker and git will keep you code and environment in order locally, but you’re part of a team now. Also, by exposing their APIs, these different providers work very well together, making your life a lot easier.

Docker Hub

Most people are used to pulling images from Docker Hub, but are less familiar with reasons why they might want to have their own. As you can probably imagine by now, I want you to create your own Docker hub repository.


recommend going through the GitHub On-Demand training for a thorough grounding in GitHub.

You will need to perform the following:

This is the sort of thing that most developers have done on their own. Where you differentiate yourself in the enterprise is building team dynamic disciplines. From now on, use GitFlow. Even when you don’t have to. Especially when you don’t have to. You fight the way you train, so don’t let bad habits become your comfort zone.

Now let’s link the two. If you store your Dockerfiles in Github, you can set up a pipeline where they can automatically update DockerHub when you check code into master after a successful pull request in GitHub.

To integrate GitHub with Docker Hub to create a CI/CD pipeline for deploying Dockerfiles, you will need to configure the Docker Service.

And maybe even beyond

You will need a credit or debit card to setup a cloud account. We will only use the free tier for this class. If you do not have a credit or debit card, don’t worry. We will still be creating local clusters.

In this class, we will evaluate multi-cloud deployments from Day One. While most companies recognize the potential value in moving to the cloud, there are still concerns around putting a company’s entire technology portfolio into a single provider. So while there is an additional administrative overhead in managing multiple cloud providers, a sensible separation of concerns can make for a stronger business case. We will use Google Cloud Platform to deploy our Hadoop cluster using Docker in Kubernetes and send processed data to Amazon Web Services to provide data to Lambda.

Amazon Web Services

Optionally, if you are going to take advantage of the cloud-based elements of the training, you should install Google Cloud Build. This will be configured to build Docker images on Google Cloud Platform when Dockerfiles are committed to the master branch. This is optional, but serves to show the importance of Continuous Integration/Continuous Deployment (CI/CD) to the modern enterprise.

Google Cloud Platform

Big Data Bootcamp Next Steps

In the first post, we set the stage for what I hope to provide in this series. In this post, I made you go through a lot of links and do a lot of installation. Starting in the next post, we will actually do some code using these tools. In fact, we’re going to install a local Hadoop instance straight from Apache.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

David Callaghan, Solutions Architect

As a solutions architect with Perficient, I bring twenty years of development experience and I'm currently hands-on with Hadoop/Spark, blockchain and cloud, coding in Java, Scala and Go. I'm certified in and work extensively with Hadoop, Cassandra, Spark, AWS, MongoDB and Pentaho. Most recently, I've been bringing integrated blockchain (particularly Hyperledger and Ethereum) and big data solutions to the cloud with an emphasis on integrating Modern Data produces such as HBase, Cassandra and Neo4J as the off-blockchain repository.

More from this Author

Follow Us