6 months of running Cloudera Hadoop plus performance benchmarks

I’ve been running Cloudera’s Hadoop offering on Ubuntu since December, 2013 and I thought after 6 months it was time to record some of my experiences.

First, my setup has ranged from 5-7 nodes on three different hypervisor platforms – XCP, Hyper-V and VMWare. Each node is provisioned with one (1) 3.4ghz core and 4gig of memory. The first 5 nodes ran on VMWare and Hyper-V, the 6th and 7th were added on XCP. My configuration requires data exist on three different nodes.

I ran a daily cron job to select count(*) using Hive – record the number of rows and the time taken to perform the query. The number of rows has ranged from 9million to close to 40 million. The source data is netflow data in csv format. Data is chunked into hundreds of thousands of files.

Much is said about BigData these days — and the combo of HDFS to distribute storage and MapReduce to distribute jobs sounds very compelling. But, one has to wonder how much of BigData’s popularity is powered by Hadoop, versus traditional RDBMS or other data stores. Hadoop is of course just a tool in the toolbox - not a silver bullet for every problem.

From my experience – even as a tool in the toolbox, it seems “not ready for prime time”.

Over the 180 days that I ran Cloudera’s Hadoop, my cron job completed successfully only about 70 times. The longest duration that the job ran successfully was only 33 days. Failures were caused by a lot of reasons – either Hive locking issues, MapReduce job completion issues and one time a total HDFS corruption where I had to rebuild and lost almost half of my data. Maintaining the cluster required constant attention and I had to intervene and fix issues about a dozen times.

It is fair to say – I didn’t do much to optimize the system. I mostly use the default configuration.

it is also fair to say that the performance statistics do show that there are performance gains to adding more nodes. For whatever reason, the more nodes that are added, the faster each node is apparently able to process the jobs.  Here are some of the performance data points.

There were two major events during the 6 months to point out – the massive data store corruption as well as going from 5 to 7 nodes. Both issues were completed around the same time.

This first graph is the total amount of time that the query took to run. Y is the amount of time taken, X is the number of rows.


The next graph is rows processed per second on the select statement. Y is the number of rows processed per second. X is the number of rows.




The next group is rows processed per second on the select per each node. Y is the rows per second per node and X is the number of total rows.





Leave a Reply

four × 4 =