Hadoop Archive

I’ve been running Cloudera’s Hadoop offering on Ubuntu since December, 2013 and I thought after 6 months it was time to record some of my experiences. First, my setup has ranged from 5-7 nodes on three different hypervisor platforms – XCP, Hyper-V and VMWare. Each node is provisioned with one (1) 3.4ghz core and 4gig of memory. The first 5 nodes ran on VMWare and Hyper-V, the 6th and 7th were added on XCP. My configuration requires data exist on three different nodes. I ran a daily cron job to select count(*) using Hive – record the number of rows and the time taken to perform the query. The number of rows has ranged from 9million to close to 40 million. The source data is netflow [&hellip

Read More...

In previous posts I’ve written about how to install Hadoop on Ubuntu in under 20 minutes, how to configure NetFlow export into Hadoop and how to add multiple nodes to your Hadoop cluster. In this post, I’ll outline how to start querying Netflow data via Hive so it can be analyzed in Excel. The expectation is that you’ve followed the previous posts in this series so that your current Hadoop installation is in a predictable state. Here are the foundational things you need to know to accomplish this task: I highly suggest shutting down your netflow collector in advance. There are parts of this procedure that may be complicated by introducing new files while the metastore is in the middle of transition Hive’s metadata store [&hellip

Read More...

In the first article here, I walked through importing netflow data into a single Hadoop instance (pseudonode) and mentioned a progression of the project to add multiple nodes. The ability to do distributed storage and distributed processing of data is ultimately the benefit of using Hadoop/HDFS. So, let’s expand on the project and add one or more nodes. Additional nodes (datanodes in Hadoop language – the namenode is the master) can be installed the same as a master, however, not all of the packages/processes are required. But, it really doesn’t matter if they are installed. So, installing a new node can follow the same procedure as the first node. The quickest install procedure I’ve seen I have documented here. You can follow this process to [&hellip

Read More...

I’ve seen a bunch of articles that provide instructions on how to install Hadoop on Ubuntu – like this one. Let’s be honest — that sounds way too hard. And it IS! It is no secret that Hadoop is “hard to install” –you’ll find references to that in a lot of places. However, I’ve installed Hadoop before and I KNOW I haven’t had to do all of those steps. But, I did install from packages. So, I set out to document an easier process. You can get a full Hadoop pseudonode up and running from scratch  in 20 minutes and here is how (my host hardware is an 8 core, 32gig, 2TB setup- performance impacts time). I’ve done this on both Hyper-V (Server 2012) and VMWare ESXI [&hellip

Read More...

It is hard to ignore all of the hype around Hadoop and Big Data these days. Like most infrastructure engineers, we tend to focus on how to build highly-available, highly-scalable networks – and I’m no exception. However, it is still important to me to keep up with and implement projects on popular trends, directly infrastructure related or not, especially when I can apply the project in some way to the infrastructure. With that, here is my first Hadoop project that uses netflow, nfdump (nfcapd), Hadoop/hdfs and Hive. The end result is being able to query historical netflow data from a hadoop data store. When you think about it – Hadoop is a great repository for netflow data written by nfdump. Hadoop handles extremely large data [&hellip

Read More...


Warning: file_get_contents(/opt/httpd/sites/rickmayberry.com/gpslogger.txt): failed to open stream: No such file or directory in /var/sites/rickmayberry.com/wp-content/plugins/google-maps-widget/google-maps-widget.php on line 150