Hadoop for network engineers Part 4 – analyzing flow data with MS Excel

In previous posts I’ve written about how to install Hadoop on Ubuntu in under 20 minutes, how to configure NetFlow export into Hadoop and how to add multiple nodes to your Hadoop cluster.

In this post, I’ll outline how to start querying Netflow data via Hive so it can be analyzed in Excel. The expectation is that you’ve followed the previous posts in this series so that your current Hadoop installation is in a predictable state.

Here are the foundational things you need to know to accomplish this task:

  • I highly suggest shutting down your netflow collector in advance. There are parts of this procedure that may be complicated by introducing new files while the metastore is in the middle of transition
  • Hive’s metadata store (Derby) does not support multiple concurrent sessions, so you need to change the underlying metadata store to MySQL on the namenode. If you were actively using Hive previously, then you were probably having issues with lock files, etc, so changing the underlying datastore to MySQL is a good idea anyways
  • ZooKeeper Server is required on all of your nodes
  • HiveServer2 needs to be installed and configured on your namenode
  • You’ll likely have to re-create your table, but this shouldn’t impact your data at all

Step #1: Install ZooKeeper on your datanodes and namenodes

This is the easiest task and can be accomplished using packages.The steps are below:

  • apt-get install zookeeper-server
  • echo “netflow” > /var/lib/zookeeper/myid  
  • service zookeeper-server init 
  • /etc/init.d/zookeeper-server start

Step #2: Install MySQL packages and configure MySQL

The guide I used for this step is here. I’ve created my own version below.

First, install the mysql-server package using the command apt-get install mysql-server . Ubuntu’s package manager should prompt you to set up a root password automatically.

After it is installed, you have to create the database and table structure per below

$ mysql -u root -p

Enter password:

mysql> CREATE DATABASE metastore;

mysql> USE metastore;

mysql> SOURCE /usr/lib/hive/scripts/metastore/upgrade/mysql/hive-schema-0.10.0.mysql.sql;

Create a Hive user

mysql> CREATE USER ‘hive’@'<IP>’ IDENTIFIED BY ‘mypassword’; 

mysql> REVOKE ALL PRIVILEGES, GRANT OPTION FROM ‘hive’@'metastorehost’;

mysql> GRANT SELECT,INSERT,UPDATE,DELETE,LOCK TABLES,EXECUTE ON metastore.* TO ‘hive’@'metastorehost’;

mysql> FLUSH PRIVILEGES;

mysql> quit;

Install Java MySQL client:

apt-get install libmysql-java

Make sure the library is in Hive’s library path:

cp /usr/share/java/mysql-connector-java-5.1.16.jar /usr/lib/hive/lib/

Step #3: Install packages on your namenode

Not all of these commands may be required, but it will not cause harm to run them

  • apt-get install hive-server2
  • apt-get install thrift
  • apt-get install hive-metastore

Step #4: Configure Hive on your namenode

The only Hive configuration required is in your hive-site.xml file in /etc/hive/conf. Here is an example hive-site.xml contents (it needs to be wrapped in an existing <configuration> stanza. The NOSASL section is particularly important and a step I didn’t see reflected in any other documentation, but is the critical step allowing Excel to work.

Personally, I would also add the ‘hive’ user to the ‘hadoop’ group on your namenode. I experienced a number of file permission issues.

<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://<hostname of your mysql server>/metastore</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>

<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>

<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value><insert username></value>
</property>

<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value><insert password></value>
</property>

<property>
<name>datanucleus.autoCreateSchema</name>
<value>false</value>
</property>

<property>
<name>datanucleus.fixedDatastore</name>
<value>true</value>
</property>

<property>
<name>hive.metastore.uris</name>
<value>thrift://<your namenode server/mysql server>:9083</value>
<description>IP address (or fully-qualified domain name) and port of the metastore host</description>
</property>
<property><name>hive.server2.authentication</name><value>NOSASL</value></property>

<property>
<name>hive.support.concurrency</name>
<description>Enable Hive’s Table Lock Manager Service</description>
<value>true</value>
</property>

<property>
<name>hive.zookeeper.quorum</name>
<description>Zookeeper quorum used by Hive’s Table Lock Manager</description>
<value><comma separated list of all the servers you put ZooKeeper on></value>
</property>

</configuration>

Step #5: Test using command line utilities

Before testing with Excel, I would make sure your data is still accessible via command line tools. You can simply run ‘hive’ as the HDFS user.

Make sure you’ve started hive-server2, MySQL and that all of your hadoop instances are running.

Commands to test from Hive CLI are “show databases” and “show tables” and “select * from netflow limit 10“. You will likely have to re-create the table pointing to the hdfs warehouse location. At least I did.

You may also have to delete and stale lock files *.lck in /var/lib/hive/metastore/metastore_db

If you’re not seeing data then you’ll want to troubleshoot all of the new components including MySQL. (“show processlist”) is a good MySQL command to reference. Don’t be concerned though, your data resides in HDFS and HDFS was not affected by any of these changes.

Step #6: Query data into Excel

You’ll need to install an ODBC driver for Cloudera’s Hive/Thrift instance which you can find here.

They have an installation guide here which worked find and didn’t require any special instructions. This is a screenshot of my driver configuration – yes, I don’t have authentication enabled.

odbc

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Here are two really good articles on querying data from Excel. These steps worked fine for me and did not require any modification.

Step 7: Analyze away

Here’s a really simple view of the data using a a pivot table. The example is bytes and packets by IP protocol.

pivot_proto

 

 

 

 

 

 

 

 

 

 

 

 

And here is by TCP Flags

pivot_flags

 

 

 

 

 

 

 

 

 

 

 

 

Debugging

For troubleshooting Hive and Thrift, your most relevant log files will be found in /var/log/hive.

General advice on Excel

Excel is not the greatest tool for large data analysis. I was able to do a like statement to restrict NetFlow data to a day’s worth of data – about 500,000 records. This seemed OK to do basic data analysis using a PivotTable on an 8 core box with 32 gig ram. But, I doubt you’d want to analyze much more data than this. I did not see support for the “LIMIT” command to limit the number of rows returned within Excel even though it is supported in Hive.

Excel also seems to query Hive multiple times before returning the data set. For me it was sometimes as many as three full jobs.

Leave a Reply


7 − four =