Hadoop - Using Node.js With HBase

While learning a bit about Node.js and D3, I wondered if there was a way to use those technologies to visualize HBase data.  Turns out, it's possible and it doesn't take much code.

Using information from a few other developers, I was able to build a simple working example and then turn it into a more robust Node.js and Express app with D3 visualization.  This post details the initial simple example using the Thrift API to connect to HBase.

All code can be found on GitHub at https://github.com/amick/node_hbase

REST vs Thrift

Most people seem to be using the Thrift API because it’s much faster.  With Thrift, you have to compile the thrift binding to the language you want to use.

REST is easier in that there are no classes/configs to compile.  The drawback is it's slower performance because the REST API returns the schema with the data.  With Thrift, only the bytes are returned and the compiled classes interpret the bytes.

Helper Resources

My research into Node.js and HBase brought me to a post from DailyJS that had clear instructions for getting Thrift compiled and ready to run.  I ran into a few problems with the steps outlined so I ended up using the code from another HBase Thrift post  to get my Thrift connection to work.  The trick was using the "transport: thrift.TBufferedTransport" instead of “transport:TFramedTransport”.

With the TFramedTransport, I kept getting an error:

error: { [Error: read ECONNRESET] code: 'ECONNRESET', errno: 'ECONNRESET', syscall: 'read’ }

As soon as I used the TBufferedTransport, everything worked.

Most of the commands listed below came from these helper posts.  I'm copying them here for completeness and flow with a few changes to match my environment.

1. Install Thrift

The Thrift install instructions for Mac OS worked for the most part.  The Boost steps take 5-10 minutes to complete so be patient.  The LibEvents install is very quick.  My first install for Thrift using "./configure --prefix=/usr/local/ --with-boost=/usr/local --with-libevent=/usr/local” , resulted in some errors and thrift was not available as a script.

After some searching, I found this page on the apache wiki.  I didn’t use the Boost install steps on this page since Boost had already installed successfully.  I did use the libEvent and Thrift commands.  Those commands still have some errors on the Thrift commands, but the result as a thrift script that was running correctly.

2. Create Node Project

After Thrift is install, the next step is to create a node project named node_hbase

mkdir node_hbase
cd node_hbase
npm init
npm install --save node-thrift

3. Generate HBase Thrift Classes

The HBase Thrift classes need to be generated inside of the node project.  The instructions in both of the above posts reference running the commands when the HBase server is running on the local machine.  For me, I was using the Hortonworks Sandbox setup with HBase.  I thought I could grab the JARs from that server and have access to the HBase.thrift file.  That’s not quite the case.  

I ended up downloading the 0.96.2 HBase tar ball from Apache.  After expanding it, the HBase.thrift file is located in hbase-0.96.2/hbase-thrift/src/main/resources/org/apache/hadoop/hbase/thrift directory.  I used that file to generate the node.js artifacts and everything is working well.

My command looked something like (your directory structure may be different):

thrift --gen js:node ../hbase-0.96.2/hbase-thrift/src/main/resources/org/apache/hadoop/hbase/thrift/Hbase.thrift

4. Connect to HBase

Now it's finally time to write some code.  I copied the code from the DailyJS post into a file named listTables.js

// This section includes a fix from here:
// https://stackoverflow.com/questions/17415528/nodejs-hbase-thrift-weirdness/21141862#21141862
var thrift = require('thrift'),
  HBase = require('./gen-nodejs/HBase.js'),
  HBaseTypes = require('./gen-nodejs/HBase_types.js'),
  connection = thrift.createConnection('localhost', 9090, {
    transport: thrift.TBufferedTransport,
    protocol: thrift.TBinaryProtocol
  });
  
connection.on('connect', function() {
  var client = thrift.createClient(HBase,connection);
  client.getTableNames(function(err,data) {
    if (err) {
      console.log('get table names error:', err);
    } else {
      console.log('hbase tables:', data);
    }
    connection.end();
  });
});

connection.on('error', function(err){
  console.log('error:', err);
});

Before running the code, be sure to startup HBase and the HBase Thrift service.  My Hortonworks Sandbox environment started both of these once I enabled HBase.

With HBase started, run 

node listTables

The output should list all of the tables defined in HBase.  Node.js is now talking to HBase.  Woohoo!!

5. More Operations

I wanted to go a little farther, so I started playing around with how to get and put data into HBase.  This proved to be a little more difficult because I couldn't find the correct function calls for gets and puts.  After some digging, I found this post (not in English but code is readable) that had examples of gets and puts.  See the scan.js file in the GitHub repo for examples.

The final hurdle I ran into was accessing HBase rows where the rowkey was a byte array.  Again, a little research led me to an example of creating an MD5 hash in javascript.  With this information in hand, I was able to build a byte array rowkey to work with gets and puts.  See the scanWithBytes.js file in GitHub for the code.

Conclusion

Overall, it took me about 3 or 4 hours to get the Node.js code communicating with HBase.  Most of that time was spent researching the errors I ran into along the way.  The get and put code took another 4 hours, most of that as I struggled to find the correct function calls.  Once I found the proper calls and MD5 Hash example, getting the code to run was easy.

Hopefully, having all of the information list here, it will take you only a couple of hours to get all of this working.