Advertisements
Home > Big Data, Hadoop, Hive, Impala, Spark > Apache Hive Or Cloudera Impala? What is Best for me?

Apache Hive Or Cloudera Impala? What is Best for me?

Hi Guys,

I’m not recommending anything Hive or Cloudera Impala here but just trying to express my experience with both the tools as noticed there is lot of puzzlement these days which one is best. Comparison between Hive and Impala is not fair. Simply I can say again it’s fully depend on your case study and your requirements on how fast you want to read data from Hadoop.

Hive in Hadoop ecosystem is meant for a data warehouse system to assist with easy data aggregations, adhoc queries over large datasets which are stored in Hadoop HDFS file systems whereas Cloudera Impala is a query engine for data stored in HDFS and HBase. And both supports the existing HiveQL*(a SQL-like language)

We can access all objects from Hive data warehouse with HiveQL which leverages the map-reduce architecture in background for data retrieval and transformation and this results in latency. Main intension behind developing Hive was not to achieve latency, it was developed for batch processing in bits and pieces. Best case study for Hive when we need long running jobs performing data heavy operations like joins on very huge datasets. Hive is very good for file processing but not recommended to handle live updates when we have HDFS streaming happening. To achieve this we have to change our architecture and need to have Hive and HBase integration. I’ll explain Hive and HBase integration to handle streaming data in upcoming articles.

Cloudera Impala doesn’t use MapReduce like Hive and doesn’t write the intermediate results to disk. It reads the data on HDFS or HBase and has its own proprietary architecture for transforming data. Cloudera Impala distribute query engine to build and distribute the query plan across the cluster. Every node reads data from HDFS or HBase locally. Impala is quite focused on traditional enterprise customers and OLAP and data warehouse workloads.

Based on above, in case of Cloudera Impala for batch-oriented or analytical queries, a node failure will lead to re-issue of the query. So, Hive still has its own place for ETL kind of jobs where re-issuing the query is pretty costly affair for Impala. So, Impala has a big advantage in queries where the runtime is short enough that node failures during the query are unlikely.

Both projects share overlapping goals, and there are substantial differences. Interestingly both projects also have major optimizations coming in the next 6 months, still here is list of features/syntax for your reference which Cloudera Impala doesn’t support as of now which Hive does.

  • support of UDFs (user defined functions) because uses a custom C++ runtime
  • support of DDLs
  • support of XML functions
  • support of JSON functions
  • serialization & deserialization, Impala can only read text files from HDFS not custom binary files as of now
  • table metadata needs to be refreshed when new files being added to data directory in HDFS(this is not a requirement in Hive)
  • very memory intensive as does not yet provide in-memory storage
  • fault-tolerant

Both systems will integrate out of the box with many Business Intelligence tools as supports ODBC and JDBC drives, and this has been a major goal for Impala.

Let’s wait till next version of Cloudera Impala and see what we have there to eat for our BigData Projects implementation meanwhile, on another side for Hive we have Shark which extends Apache Hive to dramatically speed up both in-memory and on-disk queries calming 100x faster than Hive with in-memory data, and 5-10x faster with on-disk data, depending on the queries. With long-term goal to have a unified system that supports both SQL and advanced analytics (machine learning, statistics, etc) for Hive.

Actually, I do not want to unfair with any project here but wanted to share my experience with you all. Wait and watch will be good option here mean while I will keep using Shark on the top of Hive.

Stay tuned for new articles from BigData arena …

Thanks

Sandip

References:

http://hive.apache.org/
http://www.cloudera.com/content/support/en/documentation/cloudera-impala/cloudera-impala-documentation-v1-latest.html
Advertisements
  1. July 11, 2013 at 6:27 am

    A very informative article !! Thanks !!

  2. Norbert Gergely
    July 29, 2013 at 3:25 pm

    I would be really keen to see your article about “Hive and HBase integration to handle streaming data” as promised earlier. I think this is what is my major focus nowadays.

  3. October 30, 2013 at 12:50 am

    Very nice article

  4. ps
    October 8, 2015 at 11:24 pm

    Very Informative.

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: