Friday, October 9, 2020

Home built Hadoop analytics cluster: Part 1

Home built Hadoop analytics cluster: Part 1

I am pursuing the UW Master of Science in Data Science degree from UW-Wisconsin - University of Wisconsin System / UW Extended Campus.  My home school is at UW La Crosse.  For the Fall 2020 semester, I am taking DS730 - Big Data: High-Performance Computing.

I have been learning how to implement algorithms that allow for the distributed processing of large data sets across computing clusters; creating parallel algorithms that can process large data sets; and using tools and software such as Hadoop, Pig, Hive, and Python to compare large data-processing tasks using cloud-computing services e.g. AWS.

One of the things that struck me so far in this course is how everything was "setup" for me in this course.  While I understand the intent of the course is to focus on the programming skills to handle big data within data science, part of me wants to focus on how the application gets set up in the first place.  To me, part of the learning exercise is to put together the actual infrastructure / software in addition to the programming.  This way, when issues occur in the infrastructure / software, I can have some foundational skills to do some basic troubleshooting.

In class, we were given a Docker container and deployed it to a t3.xlarge instance (single-node) Hadoop cluster.  What I'd like to do is to expand beyond that and build out an actual 3-4 node cluster.  I will dedicate the servers that I build to Hadoop, rather than build and deploy a container.  One thing that I like that the instructors of my class did well was to show how Ambari could be used to manage the Hadoop cluster.  As a result, I will be leveraging Ambari in my home-built Hadoop solution to help administer and deploy the cluster.

Project Summary
So what are goals of doing this?  The objectives of doing this project is to:
  • Build a Hadoop cluster from the "ground up"
  • Learn Hadoop Administration
  • Learn how to use Python for Map Reduce problems
  • Practice writing Pig commands
  • Start poking around with larger data sets outside of my DS730 class
Overall it is my intent to have blog posts that cover the following:
  • Building out the cluster
  • Installation of the software stack
  • Client installation that covers connectivity to the cluster
  • Explore R and Python options for using the cluster
  • Data import and cleansing
  • Plotting and analytics projects making use of data from the cluster

Building the cluster
I went through the exercise of building a single node cluster, as outlined in Data Analytics with Hadoop: An Introduction for Data Scientists on my local Linux system.  My initial thoughts were to build a 4 node cluster, to model a realistic cluster layout.  However due to budgetary constraints, it looks like I will start with 2, maybe 3 nodes.  I set a target price of $400 per node is a system with 4 cores, 32 GB of memory and 250 GB SSD storage.  

In doing some building online, I realized that a $400 target was too low.  My memory, CPU and storage came from for $286.92.  My case, power supply and motherboard came from Amazon for $216.23.  Total out the door price with shipping was $503.15.

In my next post, I will list out the BOM, cover the actual building of the hardware and setting up Linux.