Hive综合案例实战

    2 查看占用的HDFS空间

    1. Found 1 items
    2. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:19 /user/hive/warehouse/u_data/u.data
    1. 0: jdbc:hive2://hadoopmaster:10000/> select count(*) from u_data;
    2. WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. tez, spark) or using Hive 1.X releases.
    3. Query ID = hadoop_20160722102853_77aa1bc6-79c2-4916-9b07-a763d112ef41
    4. Total jobs = 1
    5. Launching Job 1 out of 1
    6. Number of reduce tasks determined at compile time: 1
    7. In order to change the average load for a reducer (in bytes):
    8. set hive.exec.reducers.bytes.per.reducer=<number>
    9. In order to limit the maximum number of reducers:
    10. set hive.exec.reducers.max=<number>
    11. In order to set a constant number of reducers:
    12. set mapreduce.job.reduces=<number>
    13. Starting Job = job_1468978056881_0003, Tracking URL = http://hadoopmaster:8088/proxy/application_1468978056881_0003/
    14. Kill Command = /usr/local/hadoop/bin/hadoop job -kill job_1468978056881_0003
    15. Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
    16. 2016-07-22 10:28:58,786 Stage-1 map = 0%, reduce = 0%
    17. 2016-07-22 10:29:03,890 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.89 sec
    18. 2016-07-22 10:29:10,005 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 1.71 sec
    19. MapReduce Total cumulative CPU time: 1 seconds 710 msec
    20. Ended Job = job_1468978056881_0003
    21. MapReduce Jobs Launched:
    22. Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 1.71 sec HDFS Read: 1987050 HDFS Write: 106 SUCCESS
    23. Total MapReduce CPU Time Spent: 1 seconds 710 msec
    24. OK
    25. WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. tez, spark) or using Hive 1.X releases.
    26. +---------+--+
    27. | c0 |
    28. +---------+--+
    29. | 100000 |
    30. +---------+--+
    31. 1 row selected (17.757 seconds)
    32. hiveMapreduce引擎计算真心在速度上不行,10W用了17秒,比关系型数据库差不少,还是要用Spark

    再我们需要了解如何用hive中的一次命令,我们可以这样用.

    1. #!/bin/bash
    2. for (( c=1; c<=10; c++ ))
    3. do
    4. echo "正在写入第 $c 次数据..."
    5. hive -e "LOAD DATA LOCAL INPATH '/home/hadoop/u.data' INTO TABLE u_data;"
    6. wait
    7. done

    插入完,检查查询成本

    1. 0: jdbc:hive2://hadoopmaster:10000/> select count(*) from u_data;
    2. WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. tez, spark) or using Hive 1.X releases.
    3. Query ID = hadoop_20160722104633_18c3467d-9263-4785-8714-1570fc3bb9ae
    4. Total jobs = 1
    5. Launching Job 1 out of 1
    6. Number of reduce tasks determined at compile time: 1
    7. In order to change the average load for a reducer (in bytes):
    8. set hive.exec.reducers.bytes.per.reducer=<number>
    9. In order to limit the maximum number of reducers:
    10. set hive.exec.reducers.max=<number>
    11. In order to set a constant number of reducers:
    12. set mapreduce.job.reduces=<number>
    13. Starting Job = job_1468978056881_0009, Tracking URL = http://hadoopmaster:8088/proxy/application_1468978056881_0009/
    14. Kill Command = /usr/local/hadoop/bin/hadoop job -kill job_1468978056881_0009
    15. Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
    16. 2016-07-22 10:46:39,037 Stage-1 map = 0%, reduce = 0%
    17. 2016-07-22 10:46:46,190 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.82 sec
    18. MapReduce Total cumulative CPU time: 2 seconds 670 msec
    19. Ended Job = job_1468978056881_0009
    20. MapReduce Jobs Launched:
    21. Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 2.67 sec HDFS Read: 77198770 HDFS Write: 107 SUCCESS
    22. Total MapReduce CPU Time Spent: 2 seconds 670 msec
    23. OK
    24. WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. tez, spark) or using Hive 1.X releases.
    25. | c0 |
    26. +----------+--+
    27. | 3900000 |
    28. +----------+--+
    29. 1 row selected (20.173 seconds)
    30. 用了20秒,看起来Mapreduce的启动成本确实有点高了
    31. hadoop@hadoopmaster:~$ hdfs dfs -ls /user/hive/warehouse/u_data
    32. Found 39 items
    33. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:37 /user/hive/warehouse/u_data/u.data
    34. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:38 /user/hive/warehouse/u_data/u_copy_1.data
    35. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:40 /user/hive/warehouse/u_data/u_copy_10.data
    36. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:40 /user/hive/warehouse/u_data/u_copy_11.data
    37. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:41 /user/hive/warehouse/u_data/u_copy_12.data
    38. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:42 /user/hive/warehouse/u_data/u_copy_13.data
    39. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:42 /user/hive/warehouse/u_data/u_copy_14.data
    40. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:42 /user/hive/warehouse/u_data/u_copy_15.data
    41. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:42 /user/hive/warehouse/u_data/u_copy_16.data
    42. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:43 /user/hive/warehouse/u_data/u_copy_17.data
    43. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:43 /user/hive/warehouse/u_data/u_copy_18.data
    44. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:43 /user/hive/warehouse/u_data/u_copy_19.data
    45. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:39 /user/hive/warehouse/u_data/u_copy_2.data
    46. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:43 /user/hive/warehouse/u_data/u_copy_20.data
    47. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:43 /user/hive/warehouse/u_data/u_copy_21.data
    48. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:43 /user/hive/warehouse/u_data/u_copy_22.data
    49. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:43 /user/hive/warehouse/u_data/u_copy_23.data
    50. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:44 /user/hive/warehouse/u_data/u_copy_24.data
    51. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:44 /user/hive/warehouse/u_data/u_copy_25.data
    52. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:44 /user/hive/warehouse/u_data/u_copy_26.data
    53. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:44 /user/hive/warehouse/u_data/u_copy_27.data
    54. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:44 /user/hive/warehouse/u_data/u_copy_28.data
    55. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:44 /user/hive/warehouse/u_data/u_copy_29.data
    56. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:39 /user/hive/warehouse/u_data/u_copy_3.data
    57. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:45 /user/hive/warehouse/u_data/u_copy_30.data
    58. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:45 /user/hive/warehouse/u_data/u_copy_31.data
    59. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:45 /user/hive/warehouse/u_data/u_copy_32.data
    60. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:45 /user/hive/warehouse/u_data/u_copy_33.data
    61. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:45 /user/hive/warehouse/u_data/u_copy_34.data
    62. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:45 /user/hive/warehouse/u_data/u_copy_35.data
    63. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:45 /user/hive/warehouse/u_data/u_copy_36.data
    64. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:46 /user/hive/warehouse/u_data/u_copy_37.data
    65. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:46 /user/hive/warehouse/u_data/u_copy_38.data
    66. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:39 /user/hive/warehouse/u_data/u_copy_4.data
    67. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:39 /user/hive/warehouse/u_data/u_copy_5.data
    68. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:39 /user/hive/warehouse/u_data/u_copy_6.data
    69. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:39 /user/hive/warehouse/u_data/u_copy_7.data
    70. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:39 /user/hive/warehouse/u_data/u_copy_8.data
    71. -rwxrwxr-x 2 hadoop supergroup 1979173 2016-07-22 10:40 /user/hive/warehouse/u_data/u_copy_9.data

    1 创建外部表并载入数据

    1. hive -e "LOAD DATA LOCAL INPATH '/home/hadoop/u.data' INTO TABLE u_data;"

    3 内部表与外部表区别

    1. 我用drop table 命令删除刚才创建的二张表,一个内表一个外表之后结果是.
    2. hadoop@hadoopmaster:~$ hdfs dfs -ls /user/hive/warehouse/
    3. Found 5 items
    4. drwxrwxr-x - hadoop supergroup 0 2016-07-20 17:25 /user/hive/warehouse/employees
    5. drwxrwxr-x - hadoop supergroup 0 2016-07-21 15:52 /user/hive/warehouse/fincials.db
    6. drwxrwxr-x - hadoop supergroup 0 2016-07-20 09:50 /user/hive/warehouse/t_hive
    7. drwxrwxr-x - hadoop supergroup 0 2016-07-20 09:54 /user/hive/warehouse/t_hive2
    8. drwxrwxr-x - hadoop supergroup 0 2016-07-22 11:04 /user/hive/warehouse/u_data_external_table
    9. 内表的数据完全删除,而外表还有
    • 在导入数据到外部表,数据并没有移动到自己的数据仓库目录下,也就是说外部表中的数据并不是由它自己来管理的!而表则不一样;
    • 在删除表的时候,Hive将会把属于表的元数据和数据全部删掉;而删除外部表的时候,Hive仅仅删除外部表的元数据,数据是不会删除的!
      那么,应该如何选择使用哪种表呢?在大多数情况没有太多的区别,因此选择只是个人喜好的问题。但是作为一个经验,如果所有处理都需要由Hive完成,那么你应该创建表,否则使用外部表!
    1. OK
    2. No rows affected (0.045 seconds)
    3. 0: jdbc:hive2://hadoopmaster:10000/>
    4. 0: jdbc:hive2://hadoopmaster:10000/> insert overwrite table bucketed_data_user select userid,movieid,rating,unixtime from u_data_partitioned_table;
    5. Query ID = hadoop_20160722140142_c272bc07-b74d-4b5b-9689-0bec2ce71780
    6. Total jobs = 1
    7. Launching Job 1 out of 1
    8. Number of reduce tasks determined at compile time: 4
    9. In order to change the average load for a reducer (in bytes):
    10. set hive.exec.reducers.bytes.per.reducer=<number>
    11. In order to limit the maximum number of reducers:
    12. set hive.exec.reducers.max=<number>
    13. In order to set a constant number of reducers:
    14. set mapreduce.job.reduces=<number>
    15. Starting Job = job_1468978056881_0010, Tracking URL = http://hadoopmaster:8088/proxy/application_1468978056881_0010/
    16. Kill Command = /usr/local/hadoop/bin/hadoop job -kill job_1468978056881_0010
    17. Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 4
    18. 2016-07-22 14:01:48,774 Stage-1 map = 0%, reduce = 0%
    19. 2016-07-22 14:01:55,978 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.89 sec
    20. 2016-07-22 14:02:06,236 Stage-1 map = 100%, reduce = 50%, Cumulative CPU 5.66 sec
    21. 2016-07-22 14:02:07,272 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 9.43 sec
    22. MapReduce Total cumulative CPU time: 9 seconds 430 msec
    23. Ended Job = job_1468978056881_0010
    24. Loading data to table default.bucketed_data_user
    25. MapReduce Jobs Launched:
    26. Stage-Stage-1: Map: 1 Reduce: 4 Cumulative CPU: 9.43 sec HDFS Read: 5959693 HDFS Write: 5937879 SUCCESS
    27. Total MapReduce CPU Time Spent: 9 seconds 430 msec
    28. OK
    29. WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. tez, spark) or using Hive 1.X releases.
    30. No rows affected (26.251 seconds)
    31. 0: jdbc:hive2://hadoopmaster:10000/>
    32. 0: jdbc:hive2://hadoopmaster:10000/> select count(*) from bucketed_data_user ;
    33. WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. tez, spark) or using Hive 1.X releases.
    34. Query ID = hadoop_20160722141056_eaf582be-4107-403a-bacd-0a18f567f576
    35. Total jobs = 1
    36. Launching Job 1 out of 1
    37. Number of reduce tasks determined at compile time: 1
    38. In order to change the average load for a reducer (in bytes):
    39. set hive.exec.reducers.bytes.per.reducer=<number>
    40. In order to limit the maximum number of reducers:
    41. set hive.exec.reducers.max=<number>
    42. In order to set a constant number of reducers:
    43. set mapreduce.job.reduces=<number>
    44. Starting Job = job_1468978056881_0012, Tracking URL = http://hadoopmaster:8088/proxy/application_1468978056881_0012/
    45. Kill Command = /usr/local/hadoop/bin/hadoop job -kill job_1468978056881_0012
    46. Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
    47. 2016-07-22 14:11:04,156 Stage-1 map = 0%, reduce = 0%
    48. 2016-07-22 14:11:09,331 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.94 sec
    49. 2016-07-22 14:11:15,488 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 1.78 sec
    50. MapReduce Total cumulative CPU time: 1 seconds 780 msec
    51. Ended Job = job_1468978056881_0012
    52. MapReduce Jobs Launched:
    53. Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 1.78 sec HDFS Read: 5945855 HDFS Write: 106 SUCCESS
    54. Total MapReduce CPU Time Spent: 1 seconds 780 msec
    55. OK
    56. WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. tez, spark) or using Hive 1.X releases.
    57. +---------+--+
    58. | c0 |
    59. +---------+--+
    60. | 300000 |
    61. +---------+--+
    62. 1 row selected (20.397 seconds)
    63. 0: jdbc:hive2://hadoopmaster:10000/>
    64. hadoop@hadoopmaster:~$ hdfs dfs -ls /user/hive/warehouse/bucketed_data_user
    65. Found 4 items
    66. -rwxrwxr-x 2 hadoop supergroup 1400994 2016-07-22 14:02 /user/hive/warehouse/bucketed_data_user/000000_0
    67. -rwxrwxr-x 2 hadoop supergroup 1493856 2016-07-22 14:02 /user/hive/warehouse/bucketed_data_user/000001_0
    68. -rwxrwxr-x 2 hadoop supergroup 1475931 2016-07-22 14:02 /user/hive/warehouse/bucketed_data_user/000003_0