Single node Pseudo-distributed cluster setup for Hadoop
Installing VirtualBox and Vagrant
To install VirtualBox, visit: https://www.virtualbox.org/wiki/Downloads
sudo dpkg -i /path/to/deb/file
sudo apt-get install -f
sudo apt-get update
sudo apt-get install virtualbox
For Vagrant: https://www.vagrantup.com/docs/
The various installation packages are available for each OS. We are using a Linux based box.
sudo apt-get install vagrant
This was not working previously. But now the latest stable version gets installed. 1.8.1 in my case. You can check your vagrant version using “vagrant –version“
Once Vagrant is installed, create a local folder and navigate to it from the terminal. We can add a box using the following commands for Ubuntu flavors of Linux, which is one of the most popular OSs for Hadoop:
vagrant box add hashicorp/precise64 (ubuntu 12)
vagrant box add hashicorp/trusty64 (ubuntu 14)
vagrant box add hashicorp/xenial64 (ubuntu 16)
You can search for a Vagrant box from the official website of Hashicorp Atlas: https://atlas.hashicorp.com/boxes/search
Once the vagrant box is added you can initialize or create a VM by issuing the command:
vagrant init ubuntu/trusty64
or, vagrant init centos/7
if you added a centos box. A file named Vagrantfile will be created under the folder that has the configuration information of the VM you are about to create. We will change this file to allocate our desired RAM and to use port-forwarding.
You are now ready to start the VM using vagrant up
You might notice some errors regarding the shared folders’ configuration. We need shared folders in our VMs to transfer data between the host and the guest machines. This will be done later. But first, we need to customize the Vagrant box. You can check the current hardware configuration of your VM by using the command “lshw”. You can see the RAM and Disk space allocated etc.
Customizing the Vagrant box
The default sizes of these VM images are very limited and might not be sufficient for testing with actual data. Hence it is necessary to increase the hard disk capacity of these VMs. Since Hadoop creates and uses many intermediate output files while executing the MapReduce programs, we will need much more capacity than the actual size of the data files. Therefore, say if we have 25GB of data to be processed, we will need 40 GB atleast in out Vagrant box.
The Vagrant box can be customized to have more than the default configured disk space with the following steps. First, we need to convert the VMDK disk to a VDI disk which can be resized. Search for the .vmdk file for the Virtual Machine image. Navigate to that location. Then use the VBoxManage tool which comes with the VirtualBox installation:
VBoxManage clonehd <old_image.vmdk> <clone_image.vdi> –format vdi
Now we can easily resize this VDI disk, e.g. to 50 GB:
VBoxManage modifyhd <clone_image.vdi> –resize 51200
I referred to this blog for this tweak: https://tuhrig.de/resizing-vagrant-box-disk-space/
The last step we need to do is just to use the new disk instead of the old one. We can do this by cloning the VDI disk back to the original VMDK disk or within a view clicks in VirtualBox.To customize the RAM, cpus and for port forwarding, we need to modify the Vagrantfile that is created after the “vagrant init ” command.
Go to forward ports section and add extra ports to be used for monitoring the jobs in Hadoop:
config.vm.network “forwarded_port”, guest: 80, host: 8080
config.vm.network “forwarded_port”, guest: 50070, host: 50070
config.vm.network “forwarded_port”, guest: 50030, host: 50030
config.vm.network “forwarded_port”, guest: 8088, host: 8088
config.vm.network “forwarded_port”, guest: 19888, host: 19888
You should use greater values for RAM and increase the cpus: config.vm.provider “virtualbox” do |vb|
# Display the VirtualBox GUI when booting the machine
# vb.gui = true
# Customize the amount of memory on the VM:
vb.memory = “6144”
So I had 6GB RAM and 3 cpus for the lone VM I had.
If you are creating several VMs, you can decide how much memory to allocate to each of them. Make sure to save some for your host machine!
Once this is done start the Vagrant box again with “vagrant up” and check the hardware configuration. Create a local directory “host” in your home directory to share with the host machine. Halt the VM and launch VirtualBox from terminal or Dash Home. Choose the VM you created and go to the Shared Folders section. Add the shared folder you wish to share with the guest machine. Call it “share” for our examples.Now, to enable shared folders, we need Guest Additions. For that go to the terminal again and use this command:
vagrant plugin install vagrant-vbguest
Next start the Vagrant box again and in the VM command line, issue the following commands inside your VM:
wget http://download.virtualbox.org/virtualbox/5.0.16/VBoxGuestAdditions_5.0.16.iso -P /tmp (This is to download from the official website)
sudo mount -o loop /tmp/VBoxGuestAdditions_5.0.16.iso /mnt (Copy to mnt folder)
sudo sh -x /mnt/VBoxLinuxAdditions.run # –keep
sudo modprobe vboxsf (This is not return any error messages)
sudo mount -t vboxsf -o uid=$UID,gid=$(id -g) share ~/host (to link the folders)
While installing, you might see errors saying Guest additions on the host and guest machines don’t match. It should be the same but the OS versions might be different for host and guest. So, it is compatible most of the times.
Installing Hadoop and inserting data into HDFS
All set. Now we need to install Hadoop. But before that, we need to check the Java version. Mostly it isn’t there. So install Java1.8 preferably amd64 openjdk8 because that is compatible with our hadoop environment requirements. Also make sure that java 1.9 is not in the classpath, because it might throw some SASL (security) errors with our programs. You can make sure of that if you have both 8 and 9 installed with:
update-alternatives –config java
Hadoop downloads can be found here: http://redrockdigimark.com/apachemirror/hadoop/common/
Untar the .gz file at the local shared directory. You can then copy that to the guest. tar xvzf file.tar.gz. I am using hadoop 2.7.x and 2.5.x in my VMs.
export HADOOP_HOME=~/hadoop-2.7.3 (Replace with your version and path)
These are needed for our programs to compile and run.
Please check that Java and Hadoop are available in your VM:
These two should return the correct versions.
Install ssh if not present.
sudo apt-get install ssh
ssh localhost should log you into another prompt.
Then you need these commands:
ssh-keygen -t dsa -P ” -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
Modify hadoop-env file under HADOOP_HOME/etc to set the JAVA_HOME and HADOOP_HOME.
# The java implementation to use.
Now, we need to configure the XML files hdfs-site.xml, mapred-site.xml, and yarn-site.xml. These files are all located in the etc/hadoop subdirectory. Out of the three modes – Standalone, Pseudodistributed and Fully distributed, we use the second one for development and testing. For that we need to modify these xmls with different property values:
<!– core-site.xml –>
<!– hdfs-site.xml –>
<!– mapred-site.xml –>
<!– yarn-site.xml –>
Make sure that all the ports are working fine.
sudo iptables -I INPUT -p tcp -s 0.0.0.0/0 –dport 50070 -j ACCEPT
sudo iptables -I INPUT -p tcp -s 0.0.0.0/0 –dport 19888 -j ACCEPT
sudo iptables -I INPUT -p tcp -s 0.0.0.0/0 –dport 8080 -j ACCEPT
sudo iptables -I INPUT -p tcp -s 0.0.0.0/0 –dport 50030 -j ACCEPT
sudo iptables -I INPUT -p tcp -s 0.0.0.0/0 –dport 80 -j ACCEPT
Next, to format the filesystem,
hdfs namenode -format
Go to hadoop home, sbin directory and execute the shell scripts for starting the servers:
mr-jobhistory-daemon.sh start historyserver
To check if the history servers are running file, use:
This will return all the running servers. We should have the namenode, datanode and historyserver apart from others.
To check the web consoles, you can use:
wget http://localhost:50070/ (for the namenode). It will download the index.html of the NameNode console.
Similarly, it can be checked from the browser of the Host.
http://localhost:8088/ is for resource manager and http://localhost:19888/ for historyserver.
Inserting data into HDFS
To create a user directory in hdfs,
hadoop fs -mkdir -p /user/$USER/data
To copy large datasets into HDFS,
hadoop fs -copyFromLocal ~/host/data.txt /user/$USER/data
You have inserted data into HDFS. Congrats! Next blog will show how to execute MapReduce programs and troubleshoot them. Happy Hadoop time with Vagrant!