OK CREATE EXTERNAL TABLE `bigdata_user`( `id` int, `uid` string, `item_id` string, `behavior_type` int, `item_category` string, `visit_date` date, `province` string) COMMENT 'Welcome to xmu dblab!' ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES ( 'field.delim'='\t', 'serialization.format'='\t') STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'hdfs://localhost:9000/bigdatacase/dataset' TBLPROPERTIES ( 'bucketing_version'='2', 'transient_lastDdlTime'='1711268943') Time taken: 0.124 seconds, Fetched: 23 row(s)
5、执行下面命令查看表的简单结构:
1
desc bigdata_user;
执行结果如下:
1 2 3 4 5 6 7 8 9
OK id int uid string item_id string behavior_type int item_category string visit_date date province string Time taken: 0.041 seconds, Fetched: 7 row(s)
2. 简单查询分析
1、先测试一下简单的指令:
1 2
# 看前10位用户对商品的行为 select behavior_type from bigdata_user limit 10;
执行结果如下:
1 2 3 4 5 6 7 8 9 10 11 12
OK 1 1 1 1 1 4 1 1 1 1 Time taken: 0.929 seconds, Fetched: 10 row(s)
2、如果要查出每位用户购买商品时的多种信息,输出语句格式为 select 列1,列2,….,列n from 表名;比如我们现在查询前20位用户购买商品时的时间和商品的种类
1
select visit_date,item_category from bigdata_user limit 20;
Automatically selecting local only mode for query Query ID = muyoukule_20240324015155_7ae1692b-c09d-4234-9d4d-3404a61fb05c Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Job running in-process (local Hadoop) 2024-03-24 01:51:57,388 Stage-1 map = 100%, reduce = 100% Ended Job = job_local2145470938_0001 MapReduce Jobs Launched: Stage-Stage-1: HDFS Read: 31204930 HDFS Write: 81922062 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK 300000 Time taken: 1.461 seconds, Fetched: 1 row(s)
我们可以看到,得出的结果为 OK 下的那个数字 300000(因为我们的 small_user.csv 中包含了 300000 条记录,导入到 Hive 中)。
2、在函数内部加上 distinct,查出 uid 不重复的数据有多少条
下面继续执行操作:
1 2
# 在函数内部加上 distinct,查出 uid 不重复的数据有多少条 select count(distinct uid) from bigdata_user;
Automatically selecting local only mode for query Query ID = muyoukule_20240324015507_77989e78-51e4-45c1-9eb1-200c0787c3ec Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Job running in-process (local Hadoop) 2024-03-24 01:55:09,012 Stage-1 map = 100%, reduce = 100% Ended Job = job_local1035121157_0002 MapReduce Jobs Launched: Stage-Stage-1: HDFS Read: 62385496 HDFS Write: 81922271 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK 270 Time taken: 1.346 seconds, Fetched: 1 row(s)
3、查询不重复的数据有多少条(为了排除客户刷单情况)
1
select count(*) from (select uid,item_id,behavior_type,item_category,visit_date,province from bigdata_user group by uid,item_id,behavior_type,item_category,visit_date,province having count(*)=1)a;
Automatically selecting local only mode for query Query ID = muyoukule_20240324015553_0c5051e4-2030-4be3-b27a-f981274e056d Total jobs = 2 Launching Job 1 out of 2 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Job running in-process (local Hadoop) 2024-03-24 01:55:55,036 Stage-1 map = 0%, reduce = 0% 2024-03-24 01:55:57,044 Stage-1 map = 100%, reduce = 0% 2024-03-24 01:55:58,049 Stage-1 map = 100%, reduce = 100% Ended Job = job_local1915786213_0003 Launching Job 2 out of 2 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Selecting local mode for task: Stage-2 Job running in-process (local Hadoop) 2024-03-24 01:56:00,021 Stage-2 map = 100%, reduce = 100% Ended Job = job_local1303362696_0004 MapReduce Jobs Launched: Stage-Stage-1: HDFS Read: 93566056 HDFS Write: 81927009 SUCCESS Stage-Stage-2: HDFS Read: 93584837 HDFS Write: 81944726 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK 284047 Time taken: 6.41 seconds, Fetched: 1 row(s)
可以看出,排除掉重复信息以后,只有 284047 条记录。
PS:嵌套语句最好取别名,就是上面的 a,否则很容易出现如下错误。
4. 关键字条件查询分析
1、以关键字的存在区间为条件的查询
使用 where 可以缩小查询分析的范围和精确度,下面用实例来测试一下。
(1) 查询 2014年12月10日 到 2014年12月13日 有多少人浏览了商品
1
select count(*) from bigdata_user where behavior_type='1' and visit_date<'2014-12-13' and visit_date>'2014-12-10';
Automatically selecting local only mode for query Query ID = muyoukule_20240324015825_ec7ac2f7-57a2-48cb-bda7-729a69bf164c Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Job running in-process (local Hadoop) 2024-03-24 01:58:27,122 Stage-1 map = 100%, reduce = 100% Ended Job = job_local803845934_0007 MapReduce Jobs Launched: Stage-Stage-1: HDFS Read: 155963028 HDFS Write: 81958359 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK 26329 Time taken: 1.395 seconds, Fetched: 1 row(s)
(2) 以月的第n天为统计单位,依次显示第n天网站卖出去的商品的个数
1
select count(distinct uid), day(visit_date) from bigdata_user where behavior_type='4' group by day(visit_date);
Automatically selecting local only mode for query Query ID = muyoukule_20240324015901_e5a2dc9f-ba6e-4c3b-918e-47ad98ff9bb7 Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Job running in-process (local Hadoop) 2024-03-24 01:59:02,325 Stage-1 map = 100%, reduce = 100% Ended Job = job_local1221129954_0008 MapReduce Jobs Launched: Stage-Stage-1: HDFS Read: 187143592 HDFS Write: 81959082 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK 37 1 48 2 42 3 38 4 42 5 33 6 42 7 36 8 34 9 40 10 43 11 98 12 39 13 43 14 42 15 44 16 42 17 66 18 38 19 50 20 33 21 34 22 32 23 47 24 34 25 31 26 30 27 34 28 39 29 38 30 Time taken: 1.31 seconds, Fetched: 30 row(s)
2、关键字赋予给定值为条件,对其他数据进行分析
取给定时间和给定地点,求当天发出到该地点的货物的数量
1
select count(*) from bigdata_user where province='江西' and visit_date='2014-12-12' and behavior_type='4';
Automatically selecting local only mode for query Query ID = muyoukule_20240324015952_f881672f-70ca-4303-b128-dd3f8c552945 Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Job running in-process (local Hadoop) 2024-03-24 01:59:53,388 Stage-1 map = 100%, reduce = 100% Ended Job = job_local479466284_0009 MapReduce Jobs Launched: Stage-Stage-1: HDFS Read: 218325182 HDFS Write: 81959801 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK 8 Time taken: 1.264 seconds, Fetched: 1 row(s)
5. 根据用户行为分析
从现在开始,我们只给出查询语句,将不再给出执行结果。
1、查询一件商品在某天的购买比例或浏览比例
1 2
# 查询有多少用户在 2014-12-11 购买了商品 select count(*) from bigdata_user where visit_date='2014-12-11'and behavior_type='4';
Automatically selecting local only mode for query Query ID = muyoukule_20240324020118_47a9bc8d-2397-42c8-a4ec-8d4ffc46674c Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Job running in-process (local Hadoop) 2024-03-24 02:01:20,097 Stage-1 map = 100%, reduce = 0% 2024-03-24 02:01:21,099 Stage-1 map = 100%, reduce = 100% Ended Job = job_local1882990576_0010 MapReduce Jobs Launched: Stage-Stage-1: HDFS Read: 249505738 HDFS Write: 81960004 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK 69 Time taken: 2.335 seconds, Fetched: 1 row(s)
1 2
# 查询有多少用户在 2014-12-11 点击了该店 select count(*) from bigdata_user where visit_date ='2014-12-11';
Automatically selecting local only mode for query Query ID = muyoukule_20240324020155_4f1a4baf-b338-4ab9-8d05-a0064474caa7 Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Job running in-process (local Hadoop) 2024-03-24 02:01:56,329 Stage-1 map = 100%, reduce = 100% Ended Job = job_local1753478999_0011 MapReduce Jobs Launched: Stage-Stage-1: HDFS Read: 280686296 HDFS Write: 81960211 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK 10649 Time taken: 1.313 seconds, Fetched: 1 row(s)
Automatically selecting local only mode for query Query ID = muyoukule_20240324020405_c0631a79-6d8d-4029-812c-6592d727a478 Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Job running in-process (local Hadoop) 2024-03-24 02:04:06,828 Stage-1 map = 100%, reduce = 100% Ended Job = job_local877196646_0012 MapReduce Jobs Launched: Stage-Stage-1: HDFS Read: 311866860 HDFS Write: 81960418 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK 69 Time taken: 1.307 seconds, Fetched: 1 row(s)
1 2
# 查询所有用户在这一天点击该网站的次数 select count(*) from bigdata_user where visit_date='2014-12-12';
Automatically selecting local only mode for query Query ID = muyoukule_20240324020425_b2082f31-ead2-4901-bd90-e24610fc4891 Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Job running in-process (local Hadoop) 2024-03-24 02:04:26,393 Stage-1 map = 100%, reduce = 100% Ended Job = job_local783450845_0013 MapReduce Jobs Launched: Stage-Stage-1: HDFS Read: 343047418 HDFS Write: 81960625 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK 17494 Time taken: 1.332 seconds, Fetched: 1 row(s)
上面两条语句的结果相除,就得到了要要求的比例。
3、给定购买商品的数量范围,查询某一天在该网站的购买该数量商品的用户id
1 2
# 查询某一天在该网站购买商品超过 5 次的用户 id select uid from bigdata_user where behavior_type='4' and visit_date='2014-12-12' group by uid having count(behavior_type='4')>5;
Automatically selecting local only mode for query Query ID = muyoukule_20240324020454_4841d58d-313e-4319-87b3-01d8836606ca Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Job running in-process (local Hadoop) 2024-03-24 02:04:55,645 Stage-1 map = 100%, reduce = 100% Ended Job = job_local1954605595_0014 MapReduce Jobs Launched: Stage-Stage-1: HDFS Read: 374227982 HDFS Write: 81961208 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK 100226515 100300684 100555417 100605 10095384 10142625 101490976 101982646 102011320 102030700 102079825 102349447 102612580 102650143 103082347 103139791 103794013 103995979 Time taken: 1.298 seconds, Fetched: 18 row(s)
6. 用户实时查询分析
某个地区的用户当天浏览网站的次数
1 2 3 4 5 6 7 8
# 创建新的数据表进行存储 create table scan(province STRING,scan INT) COMMENT 'This is the search of bigdataday' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;
# 导入数据 insert overwrite table scan select province,count(behavior_type) from bigdata_user where behavior_type='1' group by province;