Hive:复杂查询不运行Mapreducejob而启用Fetchtask

原创
小哥 3年前 (2022-11-02) 阅读数 41 #大杂烩

1.背景:

如果在hive只查询表的一列,Hive默认情况下也会启用MapReduce Job来完成这项任务。我们都知道,使MapReduce Job它将消耗开销。对于这个问题,从Hive0.10.0版本启动,类似于简单的查询语句(无函数、排序、不需要聚合的查询语句)。SELECT from

LIMIT n语句,当开启Fetch Task功能,就执行一个简单的查询语句不会生成MapRreduce作业,而是直接使用Fetch Task,从hdfs文件系统中进行查询输出数据,从而提高效率。

二、配置Fetch Task的方法

1、在hive提示符

hive> set hive.fetch.task.conversion=more;

2、启动hive时,加入参数

bin/hive --hiveconf hive.fetch.task.conversion=more

3、修改 hive-site.xml文件 ,加入属性,保存退出。
上面的两种方法都可以开启了Fetch任务,但是都是临时起作用的;如果你想一直启用这个功能,可以在${HIVE_HOME}/conf/hive-site.xml里面加入以下配置:

<property>

<name>hive.fetch.task.conversion</name>

<value>more</value>

<description>

Some select queries can be converted to single FETCH task

minimizing latency.Currently the query should be single

sourced not having any subquery and should not have

any aggregations or distincts (which incurrs RS),

lateral views and joins.

1. minimal : SELECT STAR, FILTER on partition columns, LIMIT only

2. more   : SELECT, FILTER, LIMIT only (+TABLESAMPLE, virtual columns)

</description>

</property>

三、举例说明:

1、没有配置Fetch Task,默认启用MapReduce job完成这项任务。

hive> select id from t ;                 
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there is no reduce operator
Starting Job = job_1402248601715_0004, Tracking URL = http://cdh1:8088/proxy/application_1402248601715_0004/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1402248601715_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2014-06-09 11:12:54,817 Stage-1 map = 0%,  reduce = 0%
2014-06-09 11:13:15,790 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.96 sec
2014-06-09 11:13:16,982 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.96 sec
MapReduce Total cumulative CPU time: 2 seconds 960 msec
Ended Job = job_1402248601715_0004
MapReduce Jobs Launched: 
Job 0: Map: 1   Cumulative CPU: 2.96 sec   HDFS Read: 257 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 960 msec
OK
Time taken: 51.496 seconds

查看上面的运行日志,您可以看到查询已启动mapreduce任务,mapper数为1,没有reducer任务。

2、配置fetch task,用到 hive.fetch.task.conversion 参数:


  hive.fetch.task.conversion
  minimal
  
    Some select queries can be converted to single FETCH task 
    minimizing latency.Currently the query should be single 
    sourced not having any subquery and should not have
    any aggregations or distincts (which incurrs RS), 
    lateral views and joins.
    1. minimal : SELECT STAR, FILTER on partition columns, LIMIT only
    2. more    : SELECT, FILTER, LIMIT only (+TABLESAMPLE, virtual columns)
  

此参数的缺省值为minimal,也就是奔跑的意思select * ”并带有limit查询时,会对其进行转换。FetchTask;如果参数值为more,则select一些栏目limit条件,则还会将其转换为FetchTask任务。当然,这是有前提条件的:一个单一的数据源,即您从中输入源的表或分区;distinct;不能用于视图和join。

测试,首先设置其参数值more,再次运行:

hive> set hive.fetch.task.conversion=more;
hive> select id from t limit 1;           
OK
Time taken: 0.242 seconds
hive> select id from t ;                  
OK
Time taken: 0.496 seconds

t该表是一个没有数据的空表。

版权声明

所有资源都来源于爬虫采集,如有侵权请联系我们,我们将立即删除

热门