Using Pig To Load HBase Tables

HBase is cool.  Pig is cool.  It should be easy to put cool and cool together, right?  I tried and it took a while before I could have cool talking to cool.  Thank goodness it finally worked - and it ends up being simple to get a Pig script to load data into an HBase table. My dev environment utilizes HDP Sandbox 2.0 with HBase 0.96 and Pig 0.12.0

The first tutorial article I found came from the Hortonworks Sandbox documentation and had great step by step instructions, but it was using Sandbox 1.3 with older versions of Pig and HBase.  At the time (should have known better), I simply followed the instructions hoping for the best.

The creation of the HBase table through HCatalog worked well.


CREATE TABLE meters (id STRING, site STRING) STORED BY 'org.apache.hcatalog.hbase.HBaseHCatStorageHandler' TBLPROPERTIES ( 'hbase.table.name' = 'meters', 'hbase.columns.mapping' = 'd:site', 'hcat.hbase.output.bulkMode' = 'true' ) ;

 

The HCatalog command used to run this DDL is "hcat -f filename.ddl". The DDL file needs to be on the Sandbox server but not in HDFS. With the Sandbox setup, the hcat command should run without any issues.

With the table created, it was time to copy my CSV data file to HDFS so that it could be loaded.

hadoop fs -copyFromLocal meters.csv /tmp/aamick/meters.csv

The Sandbox documentation page then has you create a 2 line Pig script and run the script on the Sandbox machine.

A = LOAD '/tmp/aamick/meters.csv' USING PigStorage(',') AS (id:chararray, site:chararray); STORE A INTO 'hbase://meters' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('d:site');

With the basic Pig script, I encountered the following errors.  Not a really helpful error message if you ask me.

2014-05-13 13:33:51,377 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_1400009795673_0002

2014-05-13 13:33:51,377 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases A

2014-05-13 13:33:51,377 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: A[1,4],A[-1,-1] C:  R:

2014-05-13 13:33:51,473 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete

2014-05-13 13:34:16,580 [main] WARN  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Ooops! Some job has failed! Specify -stop_on_failure if you want Pig to stop immediately on failure.

2014-05-13 13:34:16,580 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_1400009795673_0002 has failed! Stop running all dependent jobs

2014-05-13 13:34:16,580 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete

2014-05-13 13:34:16,706 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!

2014-05-13 13:34:16,708 [main] INFO  org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:

HadoopVersion PigVersion UserId StartedAt FinishedAt Features

2.2.0.2.0.6.0-76 0.12.0.2.0.6.0-76 root 2014-05-13 13:33:30 2014-05-13 13:34:16 UNKNOWN

Failed!

Failed Jobs:

JobId Alias Feature Message Outputs

job_1400009795673_0002 A MAP_ONLY Message: Job failed! hbase://meters,

Input(s):

Failed to read data from "/tmp/aamick/meters.csv"

Output(s):

Failed to produce result in "hbase://meters"

Counters:

Total records written : 0

Total bytes written : 0

Spillable Memory Manager spill count : 0

Total bags proactively spilled: 0

Total records proactively spilled: 0

Job DAG:
job_1400009795673_0002

2014-05-13 13:34:16,708 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed!

2014-05-13 13:34:16,717 [main] ERROR org.apache.pig.tools.grunt.GruntParser - ERROR 2244: Job failed, hadoop does not return any error message

2014-05-13 13:34:16,717 [main] ERROR org.apache.pig.tools.grunt.GruntParser - org.apache.pig.backend.executionengine.ExecException: ERROR 2244: Job failed, hadoop does not return any error message
at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:148)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:202)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:173)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
at org.apache.pig.Main.run(Main.java:607)
at org.apache.pig.Main.main(Main.java:156)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)

After a lot of trial and error, forum searches, and frustration, I finally came upon the solution.  Pig needs a little help with where to find the HBase jars.  For some reason, Pig doesn't pickup the jars from the classpath.  Once the jars are registered in the Pig script, everything runs smoothly and HBase has data in the meters table!

As far as which jars to register, I used the same jars that worked when running HBase code from Java - common, client, server, protocol, and htrace).  I also had to register the zookeeper and guava jars based on steps outlined in a post about HBase, Pig, and JRuby.  (Note: I added all of the hbase and htrace jars without testing exactly which ones were being used by the pig script)

REGISTER /usr/lib/hbase/lib/hbase-common-0.96.0.2.0.6.0-76-hadoop2.jar REGISTER /usr/lib/hbase/lib/hbase-client-0.96.0.2.0.6.0-76-hadoop2.jar REGISTER /usr/lib/hbase/lib/hbase-server-0.96.0.2.0.6.0-76-hadoop2.jar REGISTER /usr/lib/hbase/lib/hbase-protocol-0.96.0.2.0.6.0-76-hadoop2.jar REGISTER /usr/lib/hbase/lib/htrace-core-2.01.jar REGISTER /usr/lib/hbase/lib/zookeeper.jar REGISTER /usr/lib/hbase/lib/guava-12.0.1.jar

A = LOAD '/tmp/aamick/meters.csv' USING PigStorage(',') AS (id:chararray, site:chararray); STORE A INTO 'hbase://meters' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('d:site'); [/code]

To run the Pig script, run "pig meters.pig" from the Sandbox server.

Now the script runs successfully!

[code lang="text"] 2014-05-13 13:49:14,718 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_1400009795673_0005

2014-05-13 13:49:14,718 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases A

2014-05-13 13:49:14,718 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: A[9,4],A[-1,-1] C:  R:

2014-05-13 13:49:14,891 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete

2014-05-13 13:50:02,752 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete

2014-05-13 13:50:07,709 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete

2014-05-13 13:50:07,733 [main] INFO  org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:

HadoopVersion PigVersion UserId StartedAt FinishedAt Features

2.2.0.2.0.6.0-76 0.12.0.2.0.6.0-76 root 2014-05-13 13:48:54 2014-05-13 13:50:07 UNKNOWN

Success!

Job Stats (time in seconds):

JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature Outputs

job_1400009795673_0005 1 0 21 21 21 21 n/a n/a n/a n/a A MAP_ONLY hbase://meters,

Input(s):

Successfully read 6 records (479 bytes) from: "/tmp/aamick/meters.csv"

Output(s):

Successfully stored 6 records in: "hbase://meters"

Counters:

Total records written : 6

Total bytes written : 0

Spillable Memory Manager spill count : 0

Total bags proactively spilled: 0

Total records proactively spilled: 0

Job DAG:

job_1400009795673_0005

2014-05-13 13:50:08,519 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!

The Pig script is still 2 lines to load and store the data into HBase. The only extra parts are the registering of the Jar. Yes, this was frustrating to figure out, but going forward, I like the idea of 2 Pig lines versus 50+ lines of Java code to load an HBase table.