Equity Markets are subject to serious Human risks….

As I was reading about Big data’s role in stock markets, I was wondering if any technology for that matter could benefit investors in Equity. I read some articles that talked about High Frequency trading and algorithmic trading, news sentiment analysis etc. Big data and machine learning has revolutionized so many sectors and finance is one that generates volumes of data in streams. No wonder, it comes under its sphere of influence. But I am somewhat skeptical about big data’s role in equity markets. Not from a technical perspective but rather from the point of view of an investor. I have invested in the Indian equity market for nearly two years now and the amount was something that anyone with no experience in stock markets and no job would ever dare! And the ups and downs were too large for a risk to be taken. However that leap has given me an experience that will forever help me in understanding atleast a part of the politics and serious risks that are involved in the world of finance.

The equity market is crucial for a nation’s fast economic growth. And so far I have seen that all major economies are inter-dependent and connected. Any disaster in developed economies like USA, Europe and China comes banging down on the Emerging economies. The Chinese currency devaluation was in recent times one of the worst financial crisis that the equity markets around the globe have seen. And to my surprise those similar “highly negative” things happened over and over again. Another one was “Brexit” but that impact was minimal compared to the previous one and the market recovered significantly the same day. I have seen the index literally sinking from a 100+ to a 100- in a matter of seconds, but a difference not large enough to trigger a circuit breaker. It was like a free fall. I am still thinking if all individual investors  or firms managing funds could sell their stocks so fast at a particular time, say 2:35 pm when the graph became somewhat regular and in streaks rather than a normal irregular pattern, followed by a straight line… downwards! This happened several times in 2015. If you notice the graph of the NIFTY index below, what the market gained over a year was completely nullified by the end of 2015, which is exactly a years time since the beginning of of the bull market in early 2014.


What surprises me more is that those “hammering” impacts and news came up during the negative cycle of the market, meaning when the investors were selling their stocks for a normal profit booking. And that made even long term investors sell, that ultimately brought down the index to unexpectedly low levels! I saw in the news in early 2014 many experts, stock brokers and firms talking about the Indian bull market touching 9000+ levels by 2015 end and by now it should been close to 10,000. But it has reached 9200 levels only a few days back. Stocks that had touched new highs in the beginning of 2015 came tumbling down to levels so low that till today they haven’t recovered. Best example is ICICI Bank which is not touching its expected 300+ plus levels from the past several months. But I am still hopeful and invested in that stock.

When there was an interest rate cut for the first time in India announced by the RBI governor, the market cheered, but the second time the indices sank!

If stock markets were never hacked, I would have believed in those arguments about bull markets correcting 15-20%. But the way some days went in the market, I strongly believe that it is not immune to manipulation. While Algorithmic trading brings automation to the stock market, it can also breach the normal behavior and bring about panic selling, ultimately resulting in a loss for the investors, traders and the nation as a whole. Any negative sentiment in the stock market or an imminent downward trend drives away foreign investments that have a very important role in the growth of every sector and company. The stop loss is a mark that a stock trader keeps for triggering a sell action. When that level is crossed, all traders have sold their stocks at a loss and it’s a golden opportunity for a “stalking” investor to buy in bulk!

I don’t know about the political aspects of the Yuan devaluation decision but what I understand is that, it was taken at the cost of thousands of equity investors around the globe and I really feel for the Chinese traders and investors who were affected the most.

2014 was the time for general election in India and there was a sudden surge in the indices due to some growing positive sentiment. But from around mid 2015, that gain was completely lost.

Financial experts and institution heads talk about various things when it comes to any positive or negative behavior in the markets. Sometimes the market corrects because it was bound to, but not always. No matter what experts or institutions or any other people in the media say, how a particular stock, index or fund will perform can never be predicted, only guessed! Also sometimes the views are biased, so traders need to be very cautious about the buy and sell calls and suggested stop loss values for a stock. And if you are thinking of investing in equity, you need to remember that there are many people out there who will try to manage your money for their own profit! No predictive technology can save you from a loss if the political scenario is bad and some serious hackers are attempting to breach the financial markets.

If techies today take some interest in this otherwise risky and boring business, then perhaps some of these challenges can be overcome. But till something tangible happens, it is an investor’s own responsibility to analyze the patterns and remain alert.



What’s next with WebRTC?

Web Real-Time Communication is said to have found a way to replace all plugins and applications for real-time audio/video and textual messaging. Say “bye bye” to Flash! All we need is a Javascript enabled browser and we are good to go. Infact, most of us using Google Hangout, don’t know that it is shifting to WebRTC. Other biggies like Facebook and Skype are on their way. Several companies for video chatting, monitoring systems, gaming and live interaction are implementing WebRTC based solutions. The other day while exploring some current projects in this area, I came across this fantastic article by OnSip, that in short lists out all that is making a wave in this field

I see WebRTC application in various sectors that can make communication very convenient and versatile. The main reason being it can be easily implemented and “plugged” into other products to extend its features. Keep aside the networking advantages of voice and data over the internet, in the software products today that involve real-time communication in any form, are bound to get a make-over with this technology, or perhaps a complete transformation!

Web based communication

Be it collaboration, conferencing, live streaming etc. everything can be transitioned into a WebRTC based solution. More importantly I feel, what is lacking in them is the recording facility. That is going to be really helpful for workers and students who wish to refer back to what was discussed in a meeting. Imagine a website that not only streams live sessions but also gives you the recorded video in your cloud storage account or a local folder, immediately after the session gets over…

Improving Customer experience

It is obvious that the contact centers will change over time and customer complaints and grievances will be resolved in a way that is better than how it is handled today. Some enterprise software companies have already leveraged the power of WebRTC to improve customer interactions. Even for smaller business or individual consultants, practitioners and businessmen, WebRTC could solve the problem of being next to their clients.

Embedding WebRTC into other niche applications

Gaming, software development tools, interviewing platforms can become better by adding WebRTC to its interface. When two parties are interacting over some real-time data exchange, WebRTC based video or audio channels can enhance the experience. So, it is more like two people playing an online game, or a software professional  evaluating somebody’s code face-to-face. Think of a content sharing platform that is accessed by multiple users at real-time, and also reserves space for f-2-f video chatting.

WebRTC for safety…

I have seen so many applications using IoT for preventing road accidents. But I also came across an article that stated that real time video capturing and analyzing the state of the driver and passengers can help prevent accidents and mishaps. Since most smartphones have a front and a rear camera, it is possible to capture the videos of both the road and the people inside a vehicle and ensure that nothing is abnormal. Violent movements, drowsiness and the road conditions can be detected with AI. Considering that, a smartphone can be easily converted into a surveillance camera for monitoring and recording audio/video streams.

These diverse applications do tempt many technical professionals to study this field. I am sure many people will come up with some great applications of this technology as they get to know it better technically.

From my experience, it comprises mostly networking and web development concepts and is indeed revolutionary. This technology is still maturing and a lot of JS APIs are being written.

Just go to the links below to get the feel of WebRTC..

Understanding Hadoop and Running a sample MapReduce program in Java

Okay, in my last post I wrote about all that we need to do to get our environment ready for some serious development with Hadoop. If you haven’t yet read that post, please do so. There’s a lot to be done to actually start working with Hadoop using Vagrant.

To summarize what was described there:

  1. If you wish to try out Hadoop and MapReduce on your system, you first need to ensure that you have enough RAM and HD space. Minimum 8GB with i3 processor (i5 or i7 gives you higher limits of RAM – 16GB/24GB) and 500 GB hard disk. More the better…. You will be able to create more VMs and run more programs with better speed, all at the same time!
  2. Install VirtualBox, Vagrant, and Hadoop on your vagrant box.
  3. Create the shared folder for transferring data to and from the guest machine.
  4. Get the VM ready for pseudo distributed architecture of Hadoop with configurations.
  5. Insert data into HDFS, which is the official file system for Hadoop.

If you are done with these steps, then you are ready to get more knowledge about this technology while practically trying it out on your computer. That’s the best way to learn without getting bored even before you start!

So, you must already be knowing a bit about this technology that is ruling the data segment in IT right now. It is the best way of administering semi and unstructured data with localized processing. It follows the batch-processing approach and works on the entire dataset rather than just a part of it like RDBMS. So, is it better than relational databases? The answer is no. It has gained popularity because of its approach that is so revolutionary and solves what is relevant to the current scenario and our coming days in data management. It is just another tool that solves a different kind of a problem. Relational database management software work on structured data that has a predefined schema. It can do what Hadoop does but with lesser efficiency and smartness. Hadoop is what we need to analyze data from all streaming sources like IoT devices, multimedia, search engines and healthcare systems. It is not interactive, and writes data once but reads several times. Updating data in HDFS can be difficult, but it is possible. To blend in RDBMS kind of features, many sub-products of Hadoop is out there in the market. You must have heard of Apache Spark, Pig, Hive, HBase etc. We’ll discuss these later. These are related products that process a variety of data types in different ways.  You can find more details about the background of Hadoop in a  book named Hadoop: The Definitive Guide by Tom White.

Now, imagine that you get an assignment from a science department that gives you loads and loads of data collected over time and tells you to calculate some statistical information that it needs. You have not few MBs of data but, several GBs of data in text form. What would you do? They haven’t given you any schema and the data is all alpha-numerical in rows. Say its weather data and you need to calculate the maximum value.

SQL can be very very slow because it has to break it down into discreet fields and enter the values. Then calculate. Scripts can be written but that will again be efficient only to some level. 

Doug Cutting, the founder of Hadoop and Lucene project came up with an approach that divided the whole problem-solving process into 2 steps: Mapping and Reducing. The structure of the data to be processed is in the form of keys and values. So it is the process of converting the raw data in rows into key-value pairs and then calculating the result in batches of few records, recursively till the final answer is retrieved.

The raw data may look like this:

0043011990999991950051518004…9999999N9-00111+99999999999… 0043012650999991949032412004…0500001N9+01111+99999999999…
With the map function we will make it look like this:

(0, 0067011990999991950051507004…9999999N9+00001+99999999999…)
(106, 0043011990999991950051512004…9999999N9+00221+99999999999…)
(212, 0043011990999991950051518004…9999999N9-00111+99999999999…)
(318, 0043012650999991949032412004…0500001N9+01111+99999999999…)
(424, 0043012650999991949032418004…0500001N9+00781+99999999999…)

Here, the first column is a random increasing number.

After this we get,

(1950, 0)
(1950, 22)
(1950, −11)
(1949, 111)
(1949, 78) which are pairs (year, temperature) recorded.

Now, the task of calculating the Max. temperature is done by the reducer.

Initially, it will group the temperatures:

(1949, [111, 78])
(1950, [0, 22, −11])

and finally calculate and give the output as:

(1949, 111)
(1950, 22)

For our programs, you can download and use the datasets in the below github repo:

Writing our first MapReduce program in Java

The Mapper:


// The Writable interface is specific to Hadoop and we will talk about it in detail later.
import org.apache.hadoop.mapreduce.Mapper;
public class MaxTemperatureMapper
extends Mapper<LongWritable, Text, Text, IntWritable> {
private static final int MISSING = 9999;

public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {

String line = value.toString();
String year = line.substring(15, 19);
int airTemperature;
if (line.charAt(87) == ‘+’) { // parseInt doesn’t like leading plus signs
airTemperature = Integer.parseInt(line.substring(88, 92));
} else {
airTemperature = Integer.parseInt(line.substring(87, 92));
String quality = line.substring(92, 93);
if (airTemperature != MISSING && quality.matches(“[01459]”)) { // a check
context.write(new Text(year), new IntWritable(airTemperature));

The Reducer for max. temperature:

import org.apache.hadoop.mapreduce.Reducer;
public class MaxTemperatureReducer
extends Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {

int maxValue = Integer.MIN_VALUE;
for (IntWritable value : values) {
maxValue = Math.max(maxValue, value.get());
context.write(key, new IntWritable(maxValue));

The job for running the program:

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class MaxTemperature {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println(“Usage: MaxTemperature <input path> <output path>”);

Job job = new Job();
job.setJobName(“Max temperature”);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));



System.exit(job.waitForCompletion(true) ? 0 : 1);

We will discuss the line-by-line semantics of the above three programs in the next post. But for now, we will attempt to run this in hadoop and check the output. 

Once you have created a local folder in you guest machine and copied these programs there, you need to compile then using the following command:

hadoop *.java

If you get any errors, check the Java version, should be 8 or below, the PATH variable should contain the java bin folder and HADOOP_CLASSPATH should have tools.jar of Java 8. Please refer to the previous post.

If successfully done, the .class files are created and then we need to create the jar our of the class files: jar cvf mt.jar *.class

Now, you can insert the latest weather data into HDFS. Note that you can download into you host computer and copy that to your “share” folder that is in sync with the “host” folder of guest.

hadoop fs -copyFromLocal  ~/host/1990.txt  /user/$USER/data  

To check that is inserted,

hadoop fs -ls /user/$USER/data

Next, we will run the MapReduce program on our data.

hadoop jar mt.jar /user/$USER/data.txt  /user/$USER/output/1

$USER is your username. To check the output, list the files as,

hadoop fs -ls /user/$USER/output/1

You will see two files: 1. part-r-00000 and 2. _SUCCESS, the first one contains the result, something like this:


The output will give you the percentage wise map and reduce function and finally the statistics of tasks, counters, reads/writes, memory and space used etc. and most importantly the time taken. For GBs of data it might be just a few minutes.

If you are lucky, then everything might go smoothly. But there are high chances that you will get errors and exceptions. I have listed those below:

  1. SASL: Check your Java version. Shouldn’t be 9. openjdk Java 8 is the recommended version.
  2. Make sure that all the servers are running. Use jps to ensure Datanode, Namenode Historyserver are running before inserting data or running the programs. If these are stopped you may need to restart them, else the hadoop commands will throw exceptions. All need to run, not one or two.
  3. ChecksumException: Checksum Error: This is due to corruption of data. If the job has completed successfully, you may need to destroy the VM and recreate if inserting the data again into HDFS does not help 😦 Sorry!!
  4. Connection Closed…This is a very common scenario. For this you may need to halt the VM and restart it. Exit from the VM, then “vagrant halt”, “vagrant up”.
  5. hdfs.DFSClient: DataStreamer Exceptionorg.apache.hadoop.ipc.RemoteException( File /tmp/hadoop-yarn/staging/ubuntu/.staging/job_1486087445115_0003/job.jar could only be replicated to 0 nodes instead of minReplication (=1).  There are 1 datanode(s) running and no node(s) are excluded in this operation. Your HDFS disk is full and you need to add more space. User another VM that’s it. At any point in time, you can check the root file system using  hadoop fs -df command.

So, we ran a sample mapreduce program to calculate the maximum temperature.

Next  blog will describe in detail the structure of the program in detail and the actual working of the program in the cluster. Please go through the following link to get more details:

Happy Reading!








Getting Started with Hadoop on Vagrant

Single node Pseudo-distributed cluster setup for Hadoop

I was new to Vagrant until recently. But after having worked with it, managing VMs with VirtualBox has become easier, specially if you are creating and destroying them frequently. Working with large datasets in Hadoop can be risky, and for that matter any new technology you are experimenting with. Vagrant is a perfect tool to work without any risk of harming your local system and managing or working with several VMs parallelly can be a cake walk. It’s CLI-based access to the VMs but creating each of different OS types is easy and you can destroy them anytime. 
In a common development environment, we use a single node cluster to test our code. For best results, atleast 8GM RAM and 500GB hard disk are expected for the host computer. If you don’t have such a machine with you, please upgrade your RAM and hard disk to work smoothly with Vagrant+Hadoop, since we might deal with larger datasets for which significant amount of memory and disk space are needed. In an actual scenario, we might be using Oracle Virtualbox, VMware or AWS with Vagrant. We are using only Virtualbox and our host machine in this demo is Linux Ubuntu 12.04LTS.

Installing VirtualBox and Vagrant

To install VirtualBox, visit:

Download the package at a local directory and navigate to that location from terminal. Issue the following commands:
sudo dpkg -i /path/to/deb/file
sudo apt-get install -f
To install with apt-get:
sudo apt-get update
sudo apt-get install virtualbox

For Vagrant:
The various installation packages are available for each OS. We are using a Linux based box. 
To install using a package manager, use the following commands:
sudo apt-get install vagrant 

This was not working previously. But now the latest stable version gets installed. 1.8.1 in my case. You can check your vagrant version using “vagrant –version

Once Vagrant is installed, create a local folder and navigate to it from the terminal. We can add a box using the following commands for Ubuntu flavors of Linux, which is one of the most popular OSs for Hadoop:

vagrant box add hashicorp/precise64 (ubuntu 12)
vagrant box add hashicorp/trusty64  (ubuntu 14)
vagrant box add hashicorp/xenial64  (ubuntu 16)

You can search for a Vagrant box from the official website of Hashicorp Atlas:

Once the vagrant box is added you can initialize or create a VM by issuing the command:
vagrant init ubuntu/trusty64
or, vagrant init centos/7  
if you added a centos box. A file named Vagrantfile will be created under the folder that has the configuration information of the VM you are about to create. We will change this file to allocate our desired RAM and to use port-forwarding. 

You are now ready to start the VM using vagrant up 

You might notice some errors regarding the shared folders’ configuration. We need shared folders in our VMs to transfer data between the host and the guest machines. This will be done later. But first, we need to customize the Vagrant box. You can check the current hardware configuration of your VM by using the command “lshw”. You can see the RAM and Disk space allocated etc.

Customizing the Vagrant box

The default sizes of these VM images are very limited and might not be sufficient for testing with actual data. Hence it is necessary to increase the hard disk capacity of these VMs. Since Hadoop creates and uses many intermediate output files while executing the MapReduce programs, we will need much more capacity than the actual size of the data files. Therefore, say if we have 25GB of data to be processed, we will need 40 GB atleast in out Vagrant box.

The Vagrant box can be customized to have more than the default configured disk space with the following steps. First, we need to convert the VMDK disk to a VDI disk which can be resized. Search for the .vmdk file for the Virtual Machine image. Navigate to that location. Then use the VBoxManage tool which comes with the VirtualBox installation:

VBoxManage clonehd <old_image.vmdk> <clone_image.vdi> –format vdi

Now we can easily resize this VDI disk, e.g. to 50 GB:

VBoxManage modifyhd <clone_image.vdi> –resize 51200

I referred to this blog for this tweak:

The last step we need to do is just to use the new disk instead of the old one. We can do this by cloning the VDI disk back to the original VMDK disk or within a view clicks in VirtualBox.To customize the RAM, cpus and for port forwarding, we need to modify the Vagrantfile that is created after the “vagrant init ” command.

Go to forward ports section and add extra ports to be used for monitoring the jobs in Hadoop: “forwarded_port”, guest: 80, host: 8080 “forwarded_port”, guest: 50070, host: 50070 “forwarded_port”, guest: 50030, host: 50030 “forwarded_port”, guest: 8088, host: 8088 “forwarded_port”, guest: 19888, host: 19888

You should use greater values for RAM and increase the cpus:  config.vm.provider “virtualbox” do |vb|  
# Display the VirtualBox GUI when booting the machine 
#   vb.gui = true  
# Customize the amount of memory on the VM:     
vb.memory = “6144”        

So I had 6GB RAM and 3 cpus for the lone VM I had. 

If you are creating several VMs, you can decide how much memory to allocate to each of them. Make sure to save some for your host machine!

Once this is done start the Vagrant box again with “vagrant up” and check the hardware configuration. Create a local directory “host” in your home directory to share with the host machine. Halt the VM and launch VirtualBox from terminal or Dash Home. Choose the VM you created and go to the Shared Folders section. Add the shared folder you wish to share with the guest machine. Call it “share” for our examples.Now, to enable shared folders, we need Guest Additions. For that go to the terminal again and use this command:
vagrant plugin install vagrant-vbguest

Next start the Vagrant box again and in the VM command line, issue the following commands inside your VM:

wget -P /tmp (This is to download from the official website)

sudo mount -o loop /tmp/VBoxGuestAdditions_5.0.16.iso /mnt (Copy to mnt folder)

sudo sh -x /mnt/ # –keep 

sudo modprobe vboxsf  (This is not return any error messages)

sudo mount -t vboxsf -o uid=$UID,gid=$(id -g) share ~/host (to link the folders)

While installing, you might see errors saying Guest additions on the host and guest machines don’t match. It should be the same but the OS versions might be different for host and guest. So, it is compatible most of the times.

Installing Hadoop and inserting data into HDFS

All set. Now we need to install Hadoop. But before that, we need to check the Java version. Mostly it isn’t there. So install Java1.8 preferably amd64 openjdk8 because that is compatible with our hadoop environment requirements. Also make sure that java 1.9 is not in the classpath, because it might throw some SASL (security) errors with our programs. You can make sure of that if you have both 8 and 9 installed with: 
update-alternatives –config java

Hadoop downloads can be found here:

Untar the .gz file at the local shared directory. You can then copy that to the guest. tar xvzf file.tar.gz. I am using hadoop 2.7.x and 2.5.x in my VMs.

Set all the environment variables as follows:

export JAVA_HOME=/usr
export HADOOP_HOME=~/hadoop-2.7.3 (Replace with your version and path)
export HADOOP_CLASSPATH=/usr/lib/jvm/java-8-openjdk-amd64/lib/tools.jar
These are needed for our programs to compile and run.

Please check that Java and Hadoop are available in your VM:

java -version
hadoop version
These two should return the correct versions.

Install ssh if not present.
sudo apt-get install ssh

ssh localhost should log you into another prompt. 

Then you need these commands:

ssh-keygen -t dsa -P ” -f ~/.ssh/id_dsa  
cat ~/.ssh/ >> ~/.ssh/authorized_keys

Modify hadoop-env file under HADOOP_HOME/etc to set the JAVA_HOME and HADOOP_HOME.

# The java implementation to use.
export JAVA_HOME=/usr
export HADOOP_HOME=~/hadoop-2.7.3

Now, we need to configure the XML files hdfs-site.xml, mapred-site.xml, and yarn-site.xml. These files are all located in the etc/hadoop subdirectory. Out of the three modes – Standalone, Pseudodistributed and Fully distributed, we use the second one for development and testing. For that we need to modify these xmls with different property values:

<?xml version=”1.0″?>
<!– core-site.xml –>

<?xml version=”1.0″?>
<!– hdfs-site.xml –>

<?xml version=”1.0″?>
<!– mapred-site.xml –>

<?xml version=”1.0″?>
<!– yarn-site.xml –>

Make sure that all the ports are working fine.

sudo netstat -ntlp sShould return the available ports in the guest.
Else execute the following:
sudo iptables -I INPUT -p tcp -s –dport 50070 -j ACCEPT
sudo iptables -I INPUT -p tcp -s –dport 19888 -j ACCEPT
sudo iptables -I INPUT -p tcp -s –dport 8080 -j ACCEPT
sudo iptables -I INPUT -p tcp -s –dport 50030 -j ACCEPT
sudo iptables -I INPUT -p tcp -s –dport 80 -j ACCEPT

Next, to format the filesystem,
hdfs namenode -format

Go to hadoop home, sbin directory and execute the shell scripts for starting the servers: start historyserver

To check if the history servers are running file, use:
This will return all the running servers. We should have the namenode, datanode and historyserver apart from others.

To check the web consoles, you can use:

wget http://localhost:50070/ (for the namenode). It will download the index.html of the NameNode console.
Similarly, it can be checked from the browser of the Host. 
http://localhost:8088/ is for resource manager and http://localhost:19888/ for historyserver.

Inserting data into HDFS

Before running any hadoop command, always make sure that the servers are running. Otherwise, you will bet “Connection closed” error message.

To create a user directory in hdfs,
hadoop fs -mkdir -p /user/$USER/data

Then check with 
hadoop fs -ls /user/$USER   and this should return the folder created.

To copy large datasets into HDFS,
hadoop fs -copyFromLocal ~/host/data.txt  /user/$USER/data

You have inserted data into HDFS. Congrats! Next blog will show how to execute MapReduce programs and troubleshoot them. Happy Hadoop time with Vagrant!