Hadoop - Running Pig From Java

My Hadoop learning curved recently took me through Pig land.  Overall, it resulted in learning about the language and how to use it to process data within Hadoop.  I was able to write some short Pig scripts and test out filters and splits. The one area where I ran into a roadblock was trying to execute my Pig script from within Java since Java is my most comfortable language.  I like having the option of running my Hadoop processing from a Java program where I can setup multiple configs, chain steps together, etc.  Eventually, I'll tie it all together using Hadoop tools, but it's easier with Java at this time.

The Pig Wiki has some good information and an example of using PigServer to run Pig commands via Java.  The example builds the commands within the Java code rather than using a pig script file.

That's easily changed by having the Java code register a script instead of registering queries.  The pig script file must be on the local classpath.

Java Code

PigServer pigServer = new PigServer(ExecType.MAPREDUCE); pigServer.setBatchOn(); 
InputStream is = this.getClass().getClassLoader().getResourceAsStream(pigScript); pigServer.registerScript(is);

When running this code, I kept getting errors about not having correct Hadoop settings.  A search on Stack Overflow resulted in trying to use Java Properties to pass the Hadoop configuration settings.

Through various errors, a few more searches, and digging through the errors, I realized that the PigServer class is looking for a mapred-site.xml and core-site.xml on Java classpath.  The XML below is the full contents of the local config files I setup for PigServer.


<?xml version="1.0"?> 
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> 


<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

With these XML files in place, I got past the hurdle of defining my Hadoop properties.  When the code would run, I received errors about YARN configs and MapReduce 2 settings related to the hadoop-client jar even though I have not configured anything to use YARN.

My Hadoop instance was a Cloudera 4.2.0 pseudo distributed node which uses its own set of jars.  I verified that I was using the MR1 version of the jars listed in the maven repo documentation.

After doing more digging and searching, I found a reference to a Cloudera MR1 version of hadoop-client.  As soon as I used that version in pom.xml, the Java code ran without errors.

    <version>2.0.0-mr1-cdh4.2.0 </version>

After figuring this out, it makes sense that there would be different hadoop-client jars for MR1 and MR2.