Wednesday, November 4, 2020

Home built Hadoop analytics cluster: Part 5

Home built Hadoop analytics cluster: Part 5

Got the mysql database installed and configured on my secondary node.  Installed the driver on the primary node.  Set up a few users and a database.  Tested the connections.

Now hopefully all goes well with the install!

Sunday, November 1, 2020

Home built blob storage server

In the hopes of creating a "blob" like storage like Amazon S3, I recently did a Google for open source blob storage.  To my pleasure, I discovered minio.  Minio allows me to expose an S3 compatible service locally on my home network.  I can now work with large datasets in a S3-like fashion locally without having the overhead of dealing with an Internet connection.

I can also set up Minio to be a gateway to Amazon S3 or even to my local Hadoop cluster.

I also am able to set up the AWS CLI to interact with minio or have the minio client interact with AWS S3.

While redundancy can be implemented with minio, I'll save that as a project for later.

I picked up a Raspberry Pi 4 8 GB model from Amazon ($150) and a 8 TB external USB drive from Costco ($120).  One can always step down to a lower model / storage space if needed - just couldn't resist the savings on an 8 TB drive from Costco. :)

I downloaded the Raspbian Lite image, set up regionalization and my hostname.  Attached the USB drive and created a brand new ext3 partition on the USB drive, wiping out everything else.  Formatted and attached the drive and made sure it came up on reboots.

Then I downloaded minio using the wget process.  

    sudo ln -s /home/pi/minio /usr/bin/minio
    sudo ln -s /home/pi/mc /usr/bin/mc

Then I made a simple shell script ( to launch minio.

    export MINIO_ACCESS_KEY=SuperSecretAccessKey
    export MINIO_SECRET_KEY=SuperSecretSecretKey
    export MINIO_DOMAIN=blobstorage
    /usr/bin/minio server /mnt/data

Then I made a file 


Inside the file I put the following:

    Description=Minio Storage Service mnt-data.mount


I then verified the service worked as intended:

    $ sudo systemctl start minio
    $ sudo systemctl status minio

Opened my browser, and once I logged in, I was able to access the minio service via browser and I made an alias for my minio client.

Wednesday, October 28, 2020

Home built Hadoop analytics cluster: Part 4

Home built Hadoop analytics cluster: Part 4

So yay!  As mentioned for my next goals in my previous post, I finally got the remaining two boxes built out and added into my home network.  I opted to put the Hadoop cluster on it's own subnet with a dedicated unmanaged switch only for the cluster (primary and nodes).

I added the agent and metrics to all of the nodes and rebooted the servers.

Then I followed the instructions to set up the cluster, naming it "ds730" after the class that I'm currently taking - DS730: Big Data - High Performance Computing.

I also made sure I had DNS setup correctly by modifying /etc/systemd/resolved.conf and fixed my name resolution issues.

Removed firewall rules.

Removed timedatectl by doing: sudo timedatectl set-ntp no

Then installed ntp: sudo apt install ntp

Now I need to look at installing some database drivers, however I think I'm going to call it a night.


Sunday, October 18, 2020

Home built Hadoop analytics cluster: Part 3

Home built Hadoop analytics cluster: Part 3

In my previous post I covered Bill of Materials (BOM), hardware assembly, and installing Linux (Ubuntu).  In this post I will cover how I installed Ambari.

Installing Ambari
Rather than build from source, I opted to use the distribution from Cloudera (formerly HortonWorks). Ambari 2.7.5 requires official support from Cloudera, so I went down to 2.7.3 which doesn't require a support agreement with Cloudera.

Install some pre-requisites
sudo apt install python-dev
sudo apt install gcc

Add Cloudera as a distribution
sudo wget -O /etc/apt/sources.list.d/ambari.list
sudo apt-key adv --recv-keys --keyserver B9733A7A07513CAD
sudo apt-get update

Verify packages show up
apt-cache showpkg ambari-server
apt-cache showpkg ambari-agent
apt-cache showpkg ambari-metrics-assembly

Install and setup the server on primary
sudo apt-get install ambari-server 

Install and setup agent / metrics on all
sudo apt-get install ambari-agent 
sudo apt-get install ambari-metrics-assembly

Cloudera also had some instructions that I followed on how to configure the Ambari server once installed.

Next up will be building the remaining boxes of the cluster and installing the agent on those.


So, for some reason some recent updates to Ubuntu decided to cause the resolution of my monitor to go down.

So I tried fixing it doing some updates, and lo and behold, nothing worked.

Then on top of that, my virtual machine that I was using for my DS730 class decided to go belly up and didn't come back after 2 reboots.

Not having a good day.

So, now I'm going to document the installation of my workstation using a blog post so I have some documentation to fall back onto when things go haywire again.  It looks like I'm going to reinstall Ubuntu again.

Then I need to hope that rebooting the virtual machine one last time will work.

Tuesday, October 13, 2020

Home built Hadoop analytics cluster: Part 2

Home built Hadoop analytics cluster: Part 2

In my previous post, I went through my overall plan that I will be following, along with goals and topics that I will be covering.  In this post, I will cover the initial building out of the cluster.

[Bill of Materials - BOM]
[Hardware assembly]
[Installing and configuring Linux]

Bill of Materials - BOM

Memory (32 GB)
Storage (500 GB)
Power Supply (600W)
Total$511.94*** Total estimated price as of 10/12/2020
Does not include shipping/taxes

Obviously, you can swap out components as you see your needs fit.  I did not want to make a high end workstation with GPU, opting to use a CPU that had graphics built in.  I did opt to get 32 GB memory and 500 GB storage - I could have gone down to 16 GB for memory and 250 GB for storage, but I feel that memory and storage is something that I always seem to want more of.

Hardware assembly

I found the hardware assembly to be very straight forward.  Everything was compatible and I didn't have to upgrade the BIOS or do anything fancy.  The case worked well with the motherboard.  The only thing I wished for, was another fan power on the motherboard, as it only has one.  But overall, I'm pleased with the ease of assembly - it took less than 3 hours to put together and install Linux.  I now have a working process to build the secondary nodes of the Hadoop cluster.  Time to go back to Amazon and order some more parts! :)

Installing and configuring Linux

I made the decision to use Ubuntu 18.04.05 (Bionic Beaver) LTS.  While there are different flavors of Linux, Ubuntu is the distribution that seems to work well in the world of data science, and one that I've used often over the years and am most comfortable with.

I opted for the server installation ISO, as I don't want the overhead of running a GUI on the boxes and downloaded the ISO from here.

Once I did that, I followed the instructions to make a bootable USB stick.  They have instructions for Ubuntu, macOS, and Windows.

I have Firewalla Gold firewall on my home network, so I am thinking of creating a separate subnet for the Hadoop cluster.  I will most likely pick up a switch and some network cable for these machines to keep it separate from the rest of the home network.

I did install sshd so that I can run the box "headless", that is, without a monitor, keyboard and mouse, and do a remote login into the server.  I configured the server with the name of "hadoop-primary".  Make sure to update the firewall rules as well, e.g. sudo ufw allow ssh.

I held off on doing any upgrades to the server distribution, as I want to make sure that I don't upgrade and conflict with the Hadoop requirements.

I was able to ssh into the box from my Mac desktop.  The next step will be to install Ambari and Hadoop, which I'll cover in my next post.

Saturday, October 10, 2020

Hadoop Reading Material

Hadoop Reading Material

I'm starting to really get into my DS730 - Big Data: High Performance Computing class. I wanted to go beyond the instructors material and picked up some additional reading material.

Hoping this will help me be successful in the weeks to come.