What is the status of Hive on Parquet

Machine environment:
4 slave nodes
10.200.187.86 cslave1 4 Core 3G
10.200.187.87 cslave2 2 core 4G
10.200.187.88 cslave3 2 core 4G
10.200.187.89 cslave4 2 core 6G
Test effect:
[img]

[/ img]
to conclude:
1. Impala is much more efficient than Hive, given enough memory and simple SQL. Simple SQL runs on millions of pieces of data and takes a few seconds or even less than a second.
2. Impala performance has a lot to do with the storage format of the data. The text format at the million level differs ten times from the hbase format, and the parquet format at ten million levels differs from the text format a hundred times.
3. In the current cluster configuration, the Impala join at the Millions level performs slightly better than the Hive (3 to 4 times), but the join execution of large Impala tables fails at the Millions Level (insufficient memory).
4. Impala uses parquet storage (column type). Select some fields that are very efficient at querying conditions.

Problem:
The official said the new version of Impala can be used in a production environment, but industry feedback says there will be many problems. The main problem is memory overflow. The official recommendation for Impala nodes is 128G.

Recommended usage scenarios:
It is deployed in the production environment and can be applied to operations and maintenance. It is efficient to do simple data reviews. However, it has a certain memory footprint, and it is not recommended to use complex SQL such as large joins of tables.
Imapla real-time query as seen above greatly improves relative Hive performance, but it cannot achieve the effect of a relational database, so it depends on the real needs of the business case.


Part of the information:
Applicable surface:
Hive: Complex batch query tasks and data conversion tasks.
Impala: real-time data analysis. Because UDF is not supported, the problem domain that can be addressed is limited. Use them with Hive to analyze Hive's results dataset in real time.
Advantage:
Supports SQL queries to query big data quickly.
You can query existing data to reduce data loading and conversion.
Different storage formats can be selected (parquet, text, Avro, RCFile, SequeenceFile).
Can be used in conjunction with the Hive.
Disadvantage:
UDF custom function is not supported.
Full text search in the text domain is not supported.
Transformations are not supported.
The fault tolerance of the query period is not supported.
High memory requirements.
Complement:
Error Report

create table test join parquet as
> select a.tid, a.buyer_nick, b.status, b.adjust_fee, b.buyer_email
> from (select tid, buyer_nick from s_trade_big_parquet) a
> join
> (select tid, status, adjust_fee, buyer_email from s_trade_big_parquet) b
> on (a.tid = b.tid);
Query: create table testjoinparquet as select a.tid, a.buyer_nick, b.status, b.adjust_fee, b.buyer_email from (select tid, buyer_nick from s_trade_big_parquet) a join (select tid, status, adjust_fee, buyer_email from s_trade_big_parquet) b on (a.tid = b.tid)
Query aborted.
ERRORS ENCOUNTERED DURING EXECUTION:
Backend 3: Memory Limit Exceeded
Query Limit: Consumption = 1.35 GB
Fragment dd496e82ab98ee40: 19f71d48047534a2: Consumption = 16.00 KB
UDFs: Consumption = 0.00
EXCHANGE_NODE (id = 4): Consumption = 0.00
DataStreamMgr: Consumption = 0.00
HdfsTableSink: Consumption = 0.00
Fragment dd496e82ab98ee40: 19f71d48047534a6: Consumption = 1.27 GB
UDFs: Consumption = 0.00
HASH_JOIN_NODE (id = 2): Consumption = 1.07 GB
HDFS_SCAN_NODE (id = 0): Consumption = 207.88 MB
EXCHANGE_NODE (id = 3): Consumption = 1.70 MB
DataStreamMgr: Consumption = 1.70 MB
DataStreamSender: Consumption = 2.45 KB
Fragment dd496e82ab98ee40: 19f71d48047534aa: Consumption = 82.39 MB
UDFs: Consumption = 0.00
HDFS_SCAN_NODE (id = 1): Consumption = 82.19 MB
DataStreamSender: Consumption = 8.00 KB
Note: The memory has been used up causing some nodes to read abnormally. Writing the table in parquet format is very memory intensive because the parquet is in 1G units and 1G must be loaded into memory and then written.

Associated documents:
Open Source China: http://my.oschina.net/weiqingbin/blog/196143#OSC_h2_31, there are also many translation documents for the optimization and application of the Impala framework.
To compare beehive and Impala see http://www.cloudera.com/content/cloudera- content / cloudera-docs / Impala / latest / Installing and Using Impala / ciiu_langref_unsupported.html # langref_unsupported,
Chinese: http://my.oschina.net/weiqingbin/blog/189414

Reprint at: https://my.oschina.net/sniperLi/blog/688768