Single node Pseudo-distributed cluster setup for Hadoop

I was new to Vagrant until recently. But after having worked with it, managing VMs with VirtualBox has become easier, specially if you are creating and destroying them frequently. Working with large datasets in Hadoop can be risky, and for that matter any new technology you are experimenting with. Vagrant is a perfect tool to work without any risk of harming your local system and managing or working with several VMs parallelly can be a cake walk. It’s CLI-based access to the VMs but creating each of different OS types is easy and you can destroy them anytime. 
In a common development environment, we use a single node cluster to test our code. For best results, atleast 8GM RAM and 500GB hard disk are expected for the host computer. If you don’t have such a machine with you, please upgrade your RAM and hard disk to work smoothly with Vagrant+Hadoop, since we might deal with larger datasets for which significant amount of memory and disk space are needed. In an actual scenario, we might be using Oracle Virtualbox, VMware or AWS with Vagrant. We are using only Virtualbox and our host machine in this demo is Linux Ubuntu 12.04LTS.

Installing VirtualBox and Vagrant


To install VirtualBox, visit: https://www.virtualbox.org/wiki/Downloads

Download the package at a local directory and navigate to that location from terminal. Issue the following commands:
sudo dpkg -i /path/to/deb/file
sudo apt-get install -f
To install with apt-get:
sudo apt-get update
sudo apt-get install virtualbox

For Vagrant: https://www.vagrantup.com/docs/
The various installation packages are available for each OS. We are using a Linux based box. 
To install using a package manager, use the following commands:
sudo apt-get install vagrant 

This was not working previously. But now the latest stable version gets installed. 1.8.1 in my case. You can check your vagrant version using “vagrant –version

Once Vagrant is installed, create a local folder and navigate to it from the terminal. We can add a box using the following commands for Ubuntu flavors of Linux, which is one of the most popular OSs for Hadoop:

vagrant box add hashicorp/precise64 (ubuntu 12)
vagrant box add hashicorp/trusty64  (ubuntu 14)
vagrant box add hashicorp/xenial64  (ubuntu 16)

You can search for a Vagrant box from the official website of Hashicorp Atlas: https://atlas.hashicorp.com/boxes/search

Once the vagrant box is added you can initialize or create a VM by issuing the command:
vagrant init ubuntu/trusty64
or, vagrant init centos/7  
if you added a centos box. A file named Vagrantfile will be created under the folder that has the configuration information of the VM you are about to create. We will change this file to allocate our desired RAM and to use port-forwarding. 

You are now ready to start the VM using vagrant up 

You might notice some errors regarding the shared folders’ configuration. We need shared folders in our VMs to transfer data between the host and the guest machines. This will be done later. But first, we need to customize the Vagrant box. You can check the current hardware configuration of your VM by using the command “lshw”. You can see the RAM and Disk space allocated etc.

Customizing the Vagrant box

The default sizes of these VM images are very limited and might not be sufficient for testing with actual data. Hence it is necessary to increase the hard disk capacity of these VMs. Since Hadoop creates and uses many intermediate output files while executing the MapReduce programs, we will need much more capacity than the actual size of the data files. Therefore, say if we have 25GB of data to be processed, we will need 40 GB atleast in out Vagrant box.

The Vagrant box can be customized to have more than the default configured disk space with the following steps. First, we need to convert the VMDK disk to a VDI disk which can be resized. Search for the .vmdk file for the Virtual Machine image. Navigate to that location. Then use the VBoxManage tool which comes with the VirtualBox installation:

VBoxManage clonehd <old_image.vmdk> <clone_image.vdi> –format vdi

Now we can easily resize this VDI disk, e.g. to 50 GB:

VBoxManage modifyhd <clone_image.vdi> –resize 51200

I referred to this blog for this tweak: https://tuhrig.de/resizing-vagrant-box-disk-space/

The last step we need to do is just to use the new disk instead of the old one. We can do this by cloning the VDI disk back to the original VMDK disk or within a view clicks in VirtualBox.To customize the RAM, cpus and for port forwarding, we need to modify the Vagrantfile that is created after the “vagrant init ” command.

Go to forward ports section and add extra ports to be used for monitoring the jobs in Hadoop: 

config.vm.network “forwarded_port”, guest: 80, host: 8080 
config.vm.network “forwarded_port”, guest: 50070, host: 50070 
config.vm.network “forwarded_port”, guest: 50030, host: 50030 
config.vm.network “forwarded_port”, guest: 8088, host: 8088 
config.vm.network “forwarded_port”, guest: 19888, host: 19888

You should use greater values for RAM and increase the cpus:  config.vm.provider “virtualbox” do |vb|  
# Display the VirtualBox GUI when booting the machine 
#   vb.gui = true  
# Customize the amount of memory on the VM:     
vb.memory = “6144”        
vb.cpus=”3″ 
end 

So I had 6GB RAM and 3 cpus for the lone VM I had. 

If you are creating several VMs, you can decide how much memory to allocate to each of them. Make sure to save some for your host machine!

Once this is done start the Vagrant box again with “vagrant up” and check the hardware configuration. Create a local directory “host” in your home directory to share with the host machine. Halt the VM and launch VirtualBox from terminal or Dash Home. Choose the VM you created and go to the Shared Folders section. Add the shared folder you wish to share with the guest machine. Call it “share” for our examples.Now, to enable shared folders, we need Guest Additions. For that go to the terminal again and use this command:
vagrant plugin install vagrant-vbguest

Next start the Vagrant box again and in the VM command line, issue the following commands inside your VM:


wget http://download.virtualbox.org/virtualbox/5.0.16/VBoxGuestAdditions_5.0.16.iso -P /tmp (This is to download from the official website)

sudo mount -o loop /tmp/VBoxGuestAdditions_5.0.16.iso /mnt (Copy to mnt folder)


sudo sh -x /mnt/VBoxLinuxAdditions.run # –keep 


sudo modprobe vboxsf  (This is not return any error messages)

sudo mount -t vboxsf -o uid=$UID,gid=$(id -g) share ~/host (to link the folders)

While installing, you might see errors saying Guest additions on the host and guest machines don’t match. It should be the same but the OS versions might be different for host and guest. So, it is compatible most of the times.

Installing Hadoop and inserting data into HDFS

All set. Now we need to install Hadoop. But before that, we need to check the Java version. Mostly it isn’t there. So install Java1.8 preferably amd64 openjdk8 because that is compatible with our hadoop environment requirements. Also make sure that java 1.9 is not in the classpath, because it might throw some SASL (security) errors with our programs. You can make sure of that if you have both 8 and 9 installed with: 
update-alternatives –config java

Hadoop downloads can be found here: http://redrockdigimark.com/apachemirror/hadoop/common/

Untar the .gz file at the local shared directory. You can then copy that to the guest. tar xvzf file.tar.gz. I am using hadoop 2.7.x and 2.5.x in my VMs.

Set all the environment variables as follows:

export JAVA_HOME=/usr
export HADOOP_HOME=~/hadoop-2.7.3 (Replace with your version and path)
export HADOOP_CLASSPATH=/usr/lib/jvm/java-8-openjdk-amd64/lib/tools.jar
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
These are needed for our programs to compile and run.

Please check that Java and Hadoop are available in your VM:

java -version
hadoop version
These two should return the correct versions.

Install ssh if not present.
sudo apt-get install ssh

ssh localhost should log you into another prompt. 

Then you need these commands:

ssh-keygen -t dsa -P ” -f ~/.ssh/id_dsa  
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

Modify hadoop-env file under HADOOP_HOME/etc to set the JAVA_HOME and HADOOP_HOME.

# The java implementation to use.
export JAVA_HOME=/usr
export HADOOP_HOME=~/hadoop-2.7.3

Now, we need to configure the XML files hdfs-site.xml, mapred-site.xml, and yarn-site.xml. These files are all located in the etc/hadoop subdirectory. Out of the three modes – Standalone, Pseudodistributed and Fully distributed, we use the second one for development and testing. For that we need to modify these xmls with different property values:

<?xml version=”1.0″?>
<!– core-site.xml –>
<configuration>
<property> 
<name>fs.defaultFS</name> 
<value>hdfs://localhost/</value> 
</property>
</configuration>

<?xml version=”1.0″?>
<!– hdfs-site.xml –>
<configuration> 
<property> 
<name>dfs.replication</name> 
<value>1</value> 
</property>
</configuration>

<?xml version=”1.0″?>
<!– mapred-site.xml –>
<configuration> 
<property> 
<name>mapreduce.framework.name</name> 
<value>yarn</value> 
</property>
</configuration>

<?xml version=”1.0″?>
<!– yarn-site.xml –>
<configuration> 
<property> 
<name>yarn.resourcemanager.hostname</name> 
<value>localhost</value> 
</property> 
<property> 
<name>yarn.nodemanager.aux-services</name> 
<value>mapreduce_shuffle</value> 
</property>
</configuration>


Make sure that all the ports are working fine.

sudo netstat -ntlp sShould return the available ports in the guest.
Else execute the following:
sudo iptables -I INPUT -p tcp -s 0.0.0.0/0 –dport 50070 -j ACCEPT
sudo iptables -I INPUT -p tcp -s 0.0.0.0/0 –dport 19888 -j ACCEPT
sudo iptables -I INPUT -p tcp -s 0.0.0.0/0 –dport 8080 -j ACCEPT
sudo iptables -I INPUT -p tcp -s 0.0.0.0/0 –dport 50030 -j ACCEPT
sudo iptables -I INPUT -p tcp -s 0.0.0.0/0 –dport 80 -j ACCEPT

Next, to format the filesystem,
hdfs namenode -format

Go to hadoop home, sbin directory and execute the shell scripts for starting the servers:
start-dfs.sh
start-yarn.sh
mr-jobhistory-daemon.sh start historyserver

To check if the history servers are running file, use:
jps
This will return all the running servers. We should have the namenode, datanode and historyserver apart from others.

To check the web consoles, you can use:

wget http://localhost:50070/ (for the namenode). It will download the index.html of the NameNode console.
Similarly, it can be checked from the browser of the Host. 
http://localhost:8088/ is for resource manager and http://localhost:19888/ for historyserver.

Inserting data into HDFS

Before running any hadoop command, always make sure that the servers are running. Otherwise, you will bet “Connection closed” error message.

To create a user directory in hdfs,
hadoop fs -mkdir -p /user/$USER/data

Then check with 
hadoop fs -ls /user/$USER   and this should return the folder created.

To copy large datasets into HDFS,
hadoop fs -copyFromLocal ~/host/data.txt  /user/$USER/data

You have inserted data into HDFS. Congrats! Next blog will show how to execute MapReduce programs and troubleshoot them. Happy Hadoop time with Vagrant!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s