Home built Hadoop analytics cluster: Part 2
In my previous post, I went through my overall plan that I will be following, along with goals and topics that I will be covering. In this post, I will cover the initial building out of the cluster.
[Bill of Materials - BOM]
[Installing and configuring Linux]
|Memory (32 GB)||$109.99||Amazon|
|Storage (500 GB)||$57.99||Amazon|
|Power Supply (600W)||$62.99||Amazon|
|Total||$511.94||*** Total estimated price as of 10/12/2020|
Does not include shipping/taxes
Obviously, you can swap out components as you see your needs fit. I did not want to make a high end workstation with GPU, opting to use a CPU that had graphics built in. I did opt to get 32 GB memory and 500 GB storage - I could have gone down to 16 GB for memory and 250 GB for storage, but I feel that memory and storage is something that I always seem to want more of.
I found the hardware assembly to be very straight forward. Everything was compatible and I didn't have to upgrade the BIOS or do anything fancy. The case worked well with the motherboard. The only thing I wished for, was another fan power on the motherboard, as it only has one. But overall, I'm pleased with the ease of assembly - it took less than 3 hours to put together and install Linux. I now have a working process to build the secondary nodes of the Hadoop cluster. Time to go back to Amazon and order some more parts! :)
Installing and configuring Linux
I made the decision to use Ubuntu 18.04.05 (Bionic Beaver) LTS. While there are different flavors of Linux, Ubuntu is the distribution that seems to work well in the world of data science, and one that I've used often over the years and am most comfortable with.
I opted for the server installation ISO, as I don't want the overhead of running a GUI on the boxes and downloaded the ISO from here.
Once I did that, I followed the instructions to make a bootable USB stick. They have instructions for Ubuntu, macOS, and Windows.
I have Firewalla Gold firewall on my home network, so I am thinking of creating a separate subnet for the Hadoop cluster. I will most likely pick up a switch and some network cable for these machines to keep it separate from the rest of the home network.
I did install sshd so that I can run the box "headless", that is, without a monitor, keyboard and mouse, and do a remote login into the server. I configured the server with the name of "hadoop-primary". Make sure to update the firewall rules as well, e.g. sudo ufw allow ssh.
I held off on doing any upgrades to the server distribution, as I want to make sure that I don't upgrade and conflict with the Hadoop requirements.
I was able to ssh into the box from my Mac desktop. The next step will be to install Ambari and Hadoop, which I'll cover in my next post.