Wait a minute! I thought Hive and Hadoop were for "big data".
Yes and no. The term "big data" is a silly term, in my opinion. In this post, I want to walk through leveraging the benefits of a Hadoop/Hive environment to work with regular, everyday datasets. It's not big, extremely complicated, or cumbersome.
Why would you use Hive for something like that?
Because it's easier, more flexible, and faster to implement!
The sample data set
Let's look at something that a lot of companies run into. A data set provided by a third party vendor in a CSV format. Maybe it's some sensor data they monitor, or competitor analysis, or industry trends.
These files typically have lots of columns. Let's go with a data set that's 100 columns wide and has 150,000 rows. In a CSV format, this file is approximately 100MB. And to make it interesting, the vendor changes columns and adds columns every few months.
Sure, this data can be loaded into Excel as a first pass. However, Excel quickly bogs down with a file this large and every user has to load the full data set even if they only want to work on a small portion.
Using A Relational Database
For next steps, a typical corporate IT approach would be to load that data into a relational database using a tool like Information Power Center to do the data movement.
What does this process look like?
- Define a table based on the CSV format
- Load the CSV file definition and the target relational table definition into Informatica
- Build a mapping to read the flat file and write it to the target table
- Build a workflow to run the newly created mapping
- Test and deploy
Let's call that a total of about 4-8 hours of work. But what happens when the vendor changes the data format?
- Table DDL has to be updated and re-run
- Informatica file and table definitions have to be updated
- Mappings have to be updated
- Testing and deployment
That's another 2-4 hours of work just to add a few columns. That's a lot of time to add a column. Not to mention a database, ETL tool, and developers are probably on different teams at most companies.
This data set will never be considered "big data" by the industry terms. Can it be processed faster or easier by "big data" tools.
Yes. Of course, or I wouldn't be writing this.
Hive is the natural first choice for this data. It gives us the same SQL query access that a relational database gives. What does the load process look like for Hive?
- Define a table based on the CSV format. (This will always be step one for structured data regardless of the system)
- Copy the CSV file to HDFS
That's it. No imports of definitions, no other systems, no development work. Two very simple steps and the work is done. Maybe a total of 2 hours if you have to manually type out the DDL.
When new columns are added, the DDL for the table can be updated and the columns are immediately available for querying. Because of Hive's schema-on-read approach, the new files can be placed next to the existing files even if the new files have additional columns. There is no need to reload the historical data when the DDL changes (this only works if columns are appended and not inserted in the middle of the file).
Total time to work with new columns is less than one hour. Update the DDL, copy the file.
That's not possible in a relational database. If the new columns are not defined in the DDL and the ETL when the files arrive, the new new data cannot be loaded into the database. If IT is not aware of the new columns, data could be lost or an entire reload of the data set is required, taking even more time.
So, yeah, in some cases Hive makes sense for "small data". The total time is much lower and the system is much easier to maintain going forward.
As a bonus, Hadoop and the HDFS storage give you built-in data redundancy and built-in failover. Try getting that with Oracle or SQLServer out of the box and free.