Apache Hive Or Cloudera Impala? What is Best for me?

Home > Big Data, Hadoop, Hive, Impala, Spark > Apache Hive Or Cloudera Impala? What is Best for me?

Apache Hive Or Cloudera Impala? What is Best for me?

July 10, 2013 Sandip Shinde Leave a comment Go to comments

Hi Guys,

I’m not recommending anything Hive or Cloudera Impala here but just trying to express my experience with both the tools as noticed there is lot of puzzlement these days which one is best. Comparison between Hive and Impala is not fair. Simply I can say again it’s fully depend on your case study and your requirements on how fast you want to read data from Hadoop.

Hive in Hadoop ecosystem is meant for a data warehouse system to assist with easy data aggregations, adhoc queries over large datasets which are stored in Hadoop HDFS file systems whereas Cloudera Impala is a query engine for data stored in HDFS and HBase. And both supports the existing HiveQL*(a SQL-like language)

We can access all objects from Hive data warehouse with HiveQL which leverages the map-reduce architecture in background for data retrieval and transformation and this results in latency. Main intension behind developing Hive was not to achieve latency, it was developed for batch processing in bits and pieces. Best case study for Hive when we need long running jobs performing data heavy operations like joins on very huge datasets. Hive is very good for file processing but not recommended to handle live updates when we have HDFS streaming happening. To achieve this we have to change our architecture and need to have Hive and HBase integration. I’ll explain Hive and HBase integration to handle streaming data in upcoming articles.

Cloudera Impala doesn’t use MapReduce like Hive and doesn’t write the intermediate results to disk. It reads the data on HDFS or HBase and has its own proprietary architecture for transforming data. Cloudera Impala distribute query engine to build and distribute the query plan across the cluster. Every node reads data from HDFS or HBase locally. Impala is quite focused on traditional enterprise customers and OLAP and data warehouse workloads.

Based on above, in case of Cloudera Impala for batch-oriented or analytical queries, a node failure will lead to re-issue of the query. So, Hive still has its own place for ETL kind of jobs where re-issuing the query is pretty costly affair for Impala. So, Impala has a big advantage in queries where the runtime is short enough that node failures during the query are unlikely.

Both projects share overlapping goals, and there are substantial differences. Interestingly both projects also have major optimizations coming in the next 6 months, still here is list of features/syntax for your reference which Cloudera Impala doesn’t support as of now which Hive does.

support of UDFs (user defined functions) because uses a custom C++ runtime
support of DDLs
support of XML functions
support of JSON functions
serialization & deserialization, Impala can only read text files from HDFS not custom binary files as of now
table metadata needs to be refreshed when new files being added to data directory in HDFS(this is not a requirement in Hive)
very memory intensive as does not yet provide in-memory storage
fault-tolerant

Both systems will integrate out of the box with many Business Intelligence tools as supports ODBC and JDBC drives, and this has been a major goal for Impala.

Let’s wait till next version of Cloudera Impala and see what we have there to eat for our BigData Projects implementation meanwhile, on another side for Hive we have Shark which extends Apache Hive to dramatically speed up both in-memory and on-disk queries calming 100x faster than Hive with in-memory data, and 5-10x faster with on-disk data, depending on the queries. With long-term goal to have a unified system that supports both SQL and advanced analytics (machine learning, statistics, etc) for Hive.

Actually, I do not want to unfair with any project here but wanted to share my experience with you all. Wait and watch will be good option here mean while I will keep using Shark on the top of Hive.

Stay tuned for new articles from BigData arena …

Thanks

Sandip

References:

http://hive.apache.org/
http://www.cloudera.com/content/support/en/documentation/cloudera-impala/cloudera-impala-documentation-v1-latest.html

Categories: Big Data, Hadoop, Hive, Impala, Spark Tags: #BigData, #hadoop, #hive, #impala, #spark

Comments (5) Trackbacks (0) Leave a comment Trackback

Nayan Naik

July 11, 2013 at 6:27 am

Reply

A very informative article !! Thanks !!
Norbert Gergely

July 29, 2013 at 3:25 pm

Reply

I would be really keen to see your article about “Hive and HBase integration to handle streaming data” as promised earlier. I think this is what is my major focus nowadays.
Kishore Veleti

October 30, 2013 at 12:50 am

Reply

Very nice article
- Sandip Shinde
  
  November 7, 2013 at 11:01 pm
  
  Reply
  
  Thanks Kishore..:)
ps

October 8, 2015 at 11:24 pm

Reply

Very Informative.