6-Apache Hive SQL-DDL、DML

hadoop离线day06-Apache Hive SQL-DDL、DML

今日课程学习目标

#掌握HQL DDL建表语句
	理解Hive SerDe机制、分隔符语法
	掌握内外部表、分区表、分桶表创建使用
#理解HQL DDL其他语句
	修改、删除
#学会常见show语法使用
#掌握HQL load、insert加载插入语句

今日课程内容大纲

#1、HQL DDL 数据定义语言 针对表的
 核心：建表语句   直接决定了表和文件之间能否映射成功
 	数据类型
 	SerDe序列化机制
 	分隔符语法
 	内部表、外部表
 	数据存储路径
 	分区表
 	分桶表
  alter修改表

#2、HQL DML 数据操纵语言  针对表的数据
	load加载数据
	insert插入数据   insert+select
		多重插入
		动态分区插入
		数据导出
		
#3、HQL常用的show语法

知识点01：Apache Hive--DDL--概念与语法树介绍

蓝色字体是建表语法的关键字，用于指定某些功能。
[ ]中括号的语法表示可选。
|表示使用的时候，左右语法二选一。
建表语句中的语法顺序要和语法树中顺序保持一致。

知识点02：Apache Hive--DDL--建表语句--表存在忽略异常

IF NOT EXISTS

建表的时候，如果表名已经存在，默认会报错，通过IF NOT EXISTS关键字可以忽略异常。

--第一次创建表
0: jdbc:hive2://node1:10000> create table t_1(id int,name string,age int);
--再次执行
0: jdbc:hive2://node1:10000> create table t_1(id int,name string,age int);
Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. AlreadyExistsException(message:Table t_1 already exists) (state=08S01,code=1)
--Error while processing statement 执行期间的错误 往往就是逻辑错误

--加上if not exists忽略异常
0: jdbc:hive2://node1:10000> create table if not exists t_1(id int,name string,age int);

0: jdbc:hive2://node1:10000> creatf table t_1(id int,name string,age int);
Error: Error while compiling statement: FAILED: ParseException line 1:0 cannot recognize input near 'creatf' 'table' 't_1' (state=42000,code=40000)
0: jdbc:hive2://node1:10000> 
--Error while compiling statement 编译期间的错误  SQL语法问题

hivesql -->编译--->执行

知识点03：Apache Hive--DDL--建表语句--数据类型

hive语法手册

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types

Hive除了支持SQL类型之外，还支持java数据类型；
Hive除了支持基础数据类型之外，还支持复合类型（array数组 map映射）；
- 针对复合类型的数据要想直接从文件中解析成功还必须配合分隔符指定的语法。
Hive中大小写不敏感；
在建表的时候，最好表的字段类型要和文件中的类型保持一致，
- 如果不一致，Hive会尝试进行类型隐式转换，不保证转换成功，如果不成功，显示null;

栗子

--创建数据库并切换使用
create database if not exists itheima;
use itheima;

--建表
create table t_archer(
      id int comment "ID",
      name string comment "英雄名称",
      hp_max int comment "最大生命",
      mp_max int comment "最大法力",
      attack_max int comment "最高物攻",
      defense_max int comment "最大物防",
      attack_range string comment "攻击范围",
      role_main string comment "主要定位",
      role_assist string comment "次要定位"
) comment "王者荣耀射手信息"
row format delimited
fields terminated by "\t";

--查看表数据 查看表的元数据信息
select * from t_archer;
desc formatted t_archer;

--上传文件到表对应的HDFS目录下
[root@node1 hivedata]# hadoop fs -put archer.txt /user/hive/warehouse/itheima.db/t_archer

[root@node1 hiedata]# pwd
/root/hivedata

知识点04：Apache Hive--DDL--建表语句--SerDe机制、分隔符指定语法

机制：SerDe（Serializer and Deserializer）序列化机制
Hive使用SerDe机制读写HDFS上文件

读文件

HDFS files --> InputFileFormat --> <key, value> --> Deserializer --> Row object

#1、使用InputFileFormat（默认实现TextInputFormat）读取hdfs上文件
    一行一行读取数据，返回<k，v>键值对类型
#2、返回<key, value>，其中数据存储在value中

#3、使用Deserializer反序列化动作读取value  解析成为对象（Row object）
    默认的序列化类LazysimpleSerDe

写文件
- Row object --> Serializer --> <key, value> --> OutputFileFormat --> HDFS files

分隔符指定语法

语法格式

ROW FORMAT DELIMITED | SERDE

ROW FORMAT DELIMITED  表示使用LazySimpleSerDe类进行序列化解析数据
 
ROW FORMAT SERDE      表示使用其他SerDe类进行序列化解析数据

ROW FORMAT DELIMITED具体的子语法

row format delimited
[fields terminated by char]  #指定字段之间的分隔符
[collection items terminated by char]  #指定集合元素之间的分隔符
[map keys terminated by char]         #指定map类型kv之间的分隔符
[lines terminated by char]            #指定换行符

课堂练习sql

create table t_hot_skin_1(
id int,
name string,
win_rate int,
skin_price string
)
row format delimited
fields terminated by ',';


create table t_hot_skin_2(
id int,
name string,
win_rate int,
skin_price map<string,int>
)
row format delimited
fields terminated by ','
collection items terminated by '-'
map keys terminated by ':';

知识点05：Apache Hive--DDL--建表语句--默认分隔符

默认分隔符
- Hive在建表的时候，如果没有row format语法，则该表使用==\001默认分隔符==进行字段分割；
- 如果此时文件中的数据字段之间的分隔符也是\001 ，那么就可以直接映射成功。
- 针对默认分隔符，其是一个不可见分隔符，在代码层面是\001表示
- 在vim编辑器中，连续输入ctrl+v 、ctrl+a；
- 在实际工作中，Hive最喜欢的就是\001分隔符，在清洗数据的时候，有意识的把数据之间的分隔符指定为\001;

栗子

--建表
create table t_hot_hero_skin_price(
      id int,
      name string,
      win_rate int,
      skin_price map<string,int>
)
row format delimited
fields terminated by ',' --字段之间分隔符
collection items terminated by '-'  --集合元素之间分隔符
map keys terminated by ':'; --集合元素kv之间分隔符;

--上传数据
hadoop fs -put hot_hero_skin_price.txt /user/hive/warehouse/itheima.db/t_hot_hero_skin_price 

select *
from t_hot_hero_skin_price;


--有点想法： 就把map数据类型当成字符串映射如何？
create table t_hot_hero_skin_price_str(
      id int,
      name string,
      win_rate int,
      skin_price string
)
row format delimited
fields terminated by ',';

hadoop fs -put hot_hero_skin_price.txt /user/hive/warehouse/itheima.db/t_hot_hero_skin_price_str 

--结论 
	·1、不管使用map还是使用string来定义数据 都能解析映射成功
	 2、区别在于使用的过程中 一个是针对map类型数据处理  一个是针对string类型的数据处理
	 
	 
select skin_price from t_hot_hero_skin_price;	 
select skin_price from t_hot_hero_skin_price_str;


select skin_price["至尊宝"] from t_hot_hero_skin_price limit 1;	 
select skin_price["至尊宝"] from t_hot_hero_skin_price_str limit 1; --语法错误


--建表 不指定分隔符
create table t_team_ace_player(
   id int,
   team_name string,
   ace_player_name string
); --没有指定row format语句 此时采用的是默认的\001作为字段的分隔符

hadoop fs -put team_ace_player.txt /user/hive/warehouse/itheima.db/t_team_ace_player

select * from t_team_ace_player;

知识点06：Apache Hive--DDL--建表语句--内部表、外部表

--创建内部表
create table student_inner(Sno int,Sname string,Sex string,Sage int,Sdept string) row format delimited fields terminated by ',';

--创建外部表 关键字external
create external table student_external(Sno int,Sname string,Sex string,Sage int,Sdept string) row format delimited fields terminated by ',';

--上传文件到内部表、外部表中
hadoop fs -put students.txt /user/hive/warehouse/itheima.db/student_inner
hadoop fs -put students.txt /user/hive/warehouse/itheima.db/student_external

--好像没啥区别 都能映射成功 数据也都在HDFS上

--针对内部表、外部表 进行drop删除操作
drop table student_inner;  --内部表在删除的时候 元数据和数据都会被删除
drop table student_external; --外部表在删除的时候 只删除元数据  而HDFS上的数据文件不会动

外部表有什么好处

最大的好处是防止误操作删除表的时候 把表的数据一起删除。

可以通过命令去查询表的元数据信息获取表的类型

desc formatted table_name;

MANAGED_TABLE     内部表、受控表(所谓的受控指定是表对应的HDFS上的数据受到我的控制)
EXTERNAL_TABLE    外部表

知识点07：Apache Hive--DDL--建表语句--location存储位置

存储路径由hive.metastore.warehouse.dir 属性指定。默认值是：/user/hive/warehouse

不管是内部表，还是外部表，在HDFS上的路径如下：

/user/hive/warehouse/itcast.db/t_array

/user/hive/warehouse/数据库名.db/表名

栗子

--在建表的时候 可以使用location关键字指定表的路径在HDFS任意位置
create table t_team_ace_player_location(
 id int,
 team_name string,
 ace_player_name string)
 location '/aaa'; --使用location关键字指定本张表数据在hdfs上的存储路径
 
--此时再上传数据 就必须上传到指定的目录下 否则就解析映射失败了

在实际开发中，最好集中维护管理Hive表数据，避免文件在HDFS随意存放。

知识点08：Apache Hive--DDL--建表语句--分区表创建、静态数据加载、分区裁剪

分区表的引入背景

create table t_all_hero(
   id int,
   name string,
   hp_max int,
   mp_max int,
   attack_max int,
   defense_max int,
   attack_range string,
   role_main string,
   role_assist string
)
row format delimited
fields terminated by "\t";

--上传文件
hadoop fs -put archer.txt assassin.txt mage.txt support.txt tank.txt warrior.txt /user/hive/warehouse/itheima.db/t_all_hero

select * from t_all_hero;


--查询role_main主要定位是射手并且hp_max最大生命大于6000的有几个

select count(*) from t_all_hero where role_main="archer" and hp_max >6000;

--思考：上述查询sql底层应该如何去查询数据？ 要不要全表扫描？

--问题：要进行过滤 就需要全表扫描 不全表扫描呢 又得不出正确结果？性能如何优化呢？
--优化要求  如何才能够减少全表扫描 而结果又正确。

分区表的创建

--经过大脑分析 我们认为应该根据角色主定位进行分区 所以分区的字段就是role_main
create table t_all_hero_part(
   id int,
   name string,
   hp_max int,
   mp_max int,
   attack_max int,
   defense_max int,
   attack_range string,
   role_main string,
   role_assist string
)
partitioned by(role_main string)
row format delimited
fields terminated by "\t";



--错误说 分区字段重复了 好家伙
Error: Error while compiling statement: FAILED: SemanticException [Error 10035]: Column repeated in partitioning columns (state=42000,code=10035)


--分区表建表 
create table t_all_hero_part(
   id int,
   name string,
   hp_max int,
   mp_max int,
   attack_max int,
   defense_max int,
   attack_range string,
   role_main string,
   role_assist string
) partitioned by (juesedingwei string)--注意哦 这里是分区字段
row format delimited
fields terminated by "\t";

--查询分区表 发现分区字段也显示出来了
select * from t_all_hero_part;

分区表的数据加载--静态分区加载

--静态加载分区表数据
load data local inpath '/root/hivedata/archer.txt' into table t_all_hero_part partition(juesedingwei='sheshou');

load data local inpath '/root/hivedata/assassin.txt' into table t_all_hero_part partition(juesedingwei='cike');
load data local inpath '/root/hivedata/mage.txt' into table t_all_hero_part partition(juesedingwei='fashi');
load data local inpath '/root/hivedata/support.txt' into table t_all_hero_part partition(juesedingwei='fuzhu');
load data local inpath '/root/hivedata/tank.txt' into table t_all_hero_part partition(juesedingwei='tanke');
load data local inpath '/root/hivedata/warrior.txt' into table t_all_hero_part partition(juesedingwei='zhanshi');

--查询一下验证是否加载成功
select * from t_all_hero_part;

load data local inpath '/root/hivedata/warrior.txt' into table t_all_hero_part partition(juesedingwei='666');

思考：如果分区很多一个一个加载，效率如何？

因为静态分区的时候分区值是用户手动写死的有写错的风险。

知识点09：Apache Hive--DDL--建表语句--动态分区插入数据

设置允许动态分区、设置动态分区模式

--动态分区
set hive.exec.dynamic.partition=true; --注意hive3已经默认开启了
set hive.exec.dynamic.partition.mode=nonstrict;

--模式分为strict严格模式  nonstrict非严格模式
严格模式要求 分区字段中至少有一个分区是静态分区。


--举个栗子
set hive.exec.dynamic.partition.mode=strict;  --设置为严格模式


Error: Error while compiling statement: FAILED: SemanticException [Error 10096]: Dynamic partition strict mode requires at least one static partition column. To turn this off set hive.exec.dynamic.partition.mode=nonstrict (state=42000,code=10096)

--说动态分区在严格模式下 要求至少有一个分区字段为静态分区字段  
--如果不想这么做  就必须把模式设置为非严格


t_1(xxxx) partitioned by (month string,day string);

insert into t_1  partition(month="202112",day="18") -- 两个分区字段都是静态
insert into t_1  partition(month="202112",day)
insert into t_1  partition(month,day="18")   --至少一个是静态的
insert into t_1  partition(month,day)  --都是动态的

动态分区加载数据

insert + select

插入的数据来自于后面的查询语句返回的结果。

查询返回的内容，其字段类型、顺序、个数要和待插入的表保持一致。

--创建一张新的分区表 t_all_hero_part_dynamic
create table t_all_hero_part_dynamic(
    id int,
    name string,
    hp_max int,
    mp_max int,
    attack_max int,
    defense_max int,
    attack_range string,
    role_main string,
    role_assist string
) partitioned by (role_dong string)
row format delimited
fields terminated by "\t";

--执行动态分区插入  --注意 分区值并没有手动写死指定
insert into table t_all_hero_part_dynamic partition(role_dong) 
select tmp.*,tmp.role_main from t_all_hero tmp; --查询返回10个字段


--查询验证结果
select * from t_all_hero_part_dynamic;

---严格模式下报错信息
Dynamic partition strict mode requires at least one static partition column. To turn this off set hive.exec.dynamic.partition.mode=nonstrict

在执行动态分区插入数据的时候，如果是严格模式strict，要求至少一个分区为静态分区？

partition(guojia="zhongguo",sheng) --第一个分区写死了（静态）符合严格模式。

partition(guojia,sheng) --两个分区都是动态确定的需要非严格模式

分区表的使用--分区裁剪

--非分区表 全表扫描过滤查询
select count(*) from t_all_hero where role_main="archer" and hp_max >6000;

--分区表 先基于分区过滤 再查询
select count(*) from t_all_hero_part where juesedingwei="sheshou" and hp_max >6000;

知识点10：Apache Hive--DDL--建表语句--多重分区及分区表注意事项

分区表注意事项

分区表的字段不能是表中已有的字段；分区的字段也会显示在查询结果上；
分区的字段是虚拟的字段，出现在表所有字段的后面，其值来自于加载数据到表中的时候手动指定。
分区在底层的形式就是以文件夹管理不同的文件；不同文件夹就是表不同分区；文件夹的名字：

/user/hive/warehouse/数据库.db/表
/分区字段=分区值1
xxxx.txt
/分区字段=分区值2
zzzz.txt


- 分区表是一种优化表，建表的时候可以不使用，但是，当==创建分区表之后，使用分区字段查询可以减少全表扫描，提高查询的效率==。

- 企业中常用的分区字段

- 地域维度：省，市
- 时间维度：day,month,year

----

多重分区表

- 分区表支持基于多个字段进行分区

partitioned by(字段1 ，字段2....)


- 多个分区之间是一种==递进关系==，可以理解为在==前一个分区的基础上继续分区==；从底层来说就是文件夹下面继续划分子文件夹；

- ==常见的多分区就是2个分区==；

```sql
--以国家、省创建分区表
create table t_user_double_p(id int,name string,country string) partitioned by(guojia string,sheng string) row format delimited fields terminated by  ',';

--加载数据到多分区表中
load data local inpath '/root/hivedata/china_sh.txt'  into table t_user_double_p partition(guojia="zhongguo",sheng="shanghai");

load data local inpath '/root/hivedata/china_sz.txt'  into table t_user_double_p partition(guojia="zhongguo",sheng="shenzhen");

load data local inpath '/root/hivedata/usa_dezhou.txt'  into table t_user_double_p partition(guojia="meiguo",sheng="dezhou");

--查询来自于中国深圳的用户有哪些？
select * from t_user_double_p where guojia="zhongguo"and sheng="shenzhen";

知识点11：Apache Hive--DDL--建表语句--分桶表语法、创建、加载

从语法层面解析分桶含义

CLUSTERED BY xxx INTO N BUCKETS
--根据xxx字段把数据分成N桶
--根据表中的字段把数据文件成为N个部分

t_user(id int,name string);
--1、根据谁分？
 CLUSTERED BY xxx ；  xxx必须是表中的字段
--2、分成几桶？
 N BUCKETS   ；N的值就是分桶的个数
--3、分桶的规则？
clustered by id into 3 bucket

hashfunc(分桶字段)  %  N bucket  余数相同的来到同一个桶中
1、如果分桶的字段是数字类型的字段，hashfunc(分桶字段)=分桶字段本身
2、如果分桶的字段是字符串或者其他字段，hashfunc(分桶字段) = 分桶字段.hashcode

分桶的创建

CREATE TABLE itheima.t_usa_covid19_bucket(
      count_date string,
      county string,
      state string,
      fips int,
      cases int,
      deaths int)
CLUSTERED BY(state) INTO 5 BUCKETS; --分桶的字段一定要是表中已经存在的字段

--根据state州分为5桶 每个桶内根据cases确诊病例数倒序排序
CREATE TABLE itheima.t_usa_covid19_bucket_sort(
     count_date string,
     county string,
     state string,
     fips int,
     cases int,
     deaths int)
CLUSTERED BY(state)
sorted by (cases desc) INTO 5 BUCKETS;--指定每个分桶内部根据 cases倒序排序

分桶表的数据加载

--step1:开启分桶的功能 从Hive2.0开始不再需要设置
set hive.enforce.bucketing=true;

--step2:把源数据加载到普通hive表中
CREATE TABLE itheima.t_usa_covid19(
       count_date string,
       county string,
       state string,
       fips int,
       cases int,
       deaths int)
row format delimited fields terminated by ",";

--将源数据上传到HDFS，t_usa_covid19表对应的路径下
hadoop fs -put us-covid19-counties.dat /user/hive/warehouse/itheima.db/t_usa_covid19

--step3:使用insert+select语法将数据加载到分桶表中
insert into t_usa_covid19_bucket select * from t_usa_covid19;

select * from t_usa_covid19_bucket limit 10;

分桶表的使用

--基于分桶字段state查询来自于New York州的数据
--不再需要进行全表扫描过滤
--根据分桶的规则hash_function(New York) mod 5计算出分桶编号
--查询指定分桶里面的数据 就可以找出结果  此时是分桶扫描而不是全表扫描
select *
from t_usa_covid19_bucket where state="New York";

知识点12：Apache Hive--DDL--建表语句--分桶表的好处、注意事项

分桶表也是一种优化表，可以**减少join查询时笛卡尔积的数量**、提高抽样查询的效率。
分桶表的字段必须是表中已有的字段；
分桶表需要使用间接的方式才能把数据加载进入：insert+select
在join的时候，针对join的字段进行分桶，可以提高join的效率减少笛卡尔积数量。

知识点13：Apache Hive--DDL--库、表、分区其他操作

因为Hive建表、加载数据及其方便高效；在实际的应用中，如果建表有问题，通常可以直接drop删除重新创建加载数据。时间成本极低。

如果表是外部表的话，更加完美了。

Database 数据库 DDL操作

--创建数据库
create database if not exists itcast
comment "this is my first db"
with dbproperties ('createdBy'='Allen');

--描述数据库信息
describe database itcast;
describe database extended itcast;
desc database extended itcast;

--切换数据库
use default;
use itcast;
create table t_1(id int);

--删除数据库
--注意 CASCADE关键字慎重使用
DROP (DATABASE|SCHEMA) [IF EXISTS] database_name [RESTRICT|CASCADE];
drop database itcast cascade ;


--更改数据库属性
ALTER (DATABASE|SCHEMA) database_name SET DBPROPERTIES (property_name=property_value, ...);
--更改数据库所有者
ALTER (DATABASE|SCHEMA) database_name SET OWNER [USER|ROLE] user_or_role;
--更改数据库位置
ALTER (DATABASE|SCHEMA) database_name SET LOCATION hdfs_path;

Table 表 DDL操作


--下面这两个需要记住
--查询指定表的元数据信息
desc formatted itheima.t_all_hero_part;
show create table t_all_hero_part;

--1、更改表名
ALTER TABLE table_name RENAME TO new_table_name;
--2、更改表属性
ALTER TABLE table_name SET TBLPROPERTIES (property_name = property_value, ... );
--更改表注释
ALTER TABLE student SET TBLPROPERTIES ('comment' = "new comment for student table");
--3、更改SerDe属性
ALTER TABLE table_name SET SERDE serde_class_name [WITH SERDEPROPERTIES (property_name = property_value, ... )];
ALTER TABLE table_name [PARTITION partition_spec] SET SERDEPROPERTIES serde_properties;
ALTER TABLE table_name SET SERDEPROPERTIES ('field.delim' = ',');
--移除SerDe属性
ALTER TABLE table_name [PARTITION partition_spec] UNSET SERDEPROPERTIES (property_name, ... );

--4、更改表的文件存储格式 该操作仅更改表元数据。现有数据的任何转换都必须在Hive之外进行。
ALTER TABLE table_name  SET FILEFORMAT file_format;
--5、更改表的存储位置路径
ALTER TABLE table_name SET LOCATION "new location";

--6、更改列名称/类型/位置/注释
CREATE TABLE test_change (a int, b int, c int);
// First change column a's name to a1.
ALTER TABLE test_change CHANGE a a1 INT;
// Next change column a1's name to a2, its data type to string, and put it after column b.
ALTER TABLE test_change CHANGE a1 a2 STRING AFTER b;
// The new table's structure is:  b int, a2 string, c int.
// Then change column c's name to c1, and put it as the first column.
ALTER TABLE test_change CHANGE c c1 INT FIRST;
// The new table's structure is:  c1 int, b int, a2 string.
// Add a comment to column a1
ALTER TABLE test_change CHANGE a1 a1 INT COMMENT 'this is column a1';

--7、添加/替换列
--使用ADD COLUMNS，您可以将新列添加到现有列的末尾但在分区列之前。
--REPLACE COLUMNS 将删除所有现有列，并添加新的列集。
ALTER TABLE table_name ADD|REPLACE COLUMNS (col_name data_type,...);

Partition分区 DDL操作

比较重要的是增加分区、删除分区操作

--1、增加分区
--step1: 创建表 手动加载分区数据
drop table if exists t_user_province;
create table t_user_province (
    num int,
    name string,
    sex string,
    age int,
    dept string) partitioned by (province string);

load data local inpath '/root/hivedata/students.txt' into table t_user_province partition(province ="SH");


--step2:手动创建分区的文件夹 且手动上传文件到分区中 绕开了hive操作  发现hive无法识别新分区
hadoop fs -mkdir /user/hive/warehouse/itheima.db/t_user_province/province=XM
hadoop fs -put students.txt /user/hive/warehouse/itheima.db/t_user_province/province=XM


--step3：修改hive的分区，添加一个分区元数据
ALTER TABLE t_user_province ADD PARTITION (province='XM') location
    '/user/hive/warehouse/itheima.db/t_user_province/province=XM';


----此外还支持一次添加多个分区
ALTER TABLE table_name ADD PARTITION (dt='2008-08-08', country='us') location '/path/to/us/part080808'
    PARTITION (dt='2008-08-09', country='us') location '/path/to/us/part080809';


--2、重命名分区
ALTER TABLE t_user_province PARTITION (province ="SH") RENAME TO PARTITION (province ="Shanghai");

--3、删除分区
ALTER TABLE table_name DROP [IF EXISTS] PARTITION (dt='2008-08-08', country='us');
ALTER TABLE table_name DROP [IF EXISTS] PARTITION (dt='2008-08-08', country='us') PURGE; --直接删除数据 不进垃圾桶 有点像skipTrash

--4、修复分区
MSCK [REPAIR] TABLE table_name [ADD/DROP/SYNC PARTITIONS];
--详细使用见课件资料

--5、修改分区
--更改分区文件存储格式
ALTER TABLE table_name PARTITION (dt='2008-08-09') SET FILEFORMAT file_format;
--更改分区位置
ALTER TABLE table_name PARTITION (dt='2008-08-09') SET LOCATION "new location";

知识点14：Apache Hive--常见的show语法

show databases 数据库

show tables 表

show partitions 表的所有分区注意必须是分区表才可以执行该语法

desc formatted table_name; 查看表的元数据信息

show create table table_name; 获取表的DDL建表语句

show functions; 函数方法

--1、显示所有数据库 SCHEMAS和DATABASES的用法 功能一样
show databases;
show schemas;

--2、显示当前数据库所有表/视图/物化视图/分区/索引
show tables;
SHOW TABLES [IN database_name]; --指定某个数据库

--3、显示当前数据库下所有视图
--视图相当于没有数据临时表  虚拟表
Show Views;
SHOW VIEWS 'test_*'; -- show all views that start with "test_"
SHOW VIEWS FROM test1; -- show views from database test1
SHOW VIEWS [IN/FROM database_name];

--4、显示当前数据库下所有物化视图
SHOW MATERIALIZED VIEWS [IN/FROM database_name];

--5、显示表分区信息，分区按字母顺序列出，不是分区表执行该语句会报错
show partitions table_name;
show partitions itheima.student_partition;

--6、显示表/分区的扩展信息
SHOW TABLE EXTENDED [IN|FROM database_name] LIKE table_name;
show table extended like student;
describe formatted itheima.student;

--7、显示表的属性信息
SHOW TBLPROPERTIES table_name;
show tblproperties student;

--8、显示表、视图的创建语句
SHOW CREATE TABLE ([db_name.]table_name|view_name);
show create table student;

--9、显示表中的所有列，包括分区列。
SHOW COLUMNS (FROM|IN) table_name [(FROM|IN) db_name];
show columns  in student;

--10、显示当前支持的所有自定义和内置的函数
show functions;

--11、Describe desc
--查看表信息
desc extended table_name;
--查看表信息（格式化美观）
desc formatted table_name;
--查看数据库相关信息
describe database database_name;

---

知识点15：Apache Hive--DML--load加载数据（注意local含义）

功能：load加载操作是将数据文件移动到与 Hive表对应的位置的纯复制/移动操作.

#1、如何理解这个纯字？
load命令是单纯数据搬运移到动作，移到加载的过程中，不会对数据进行任何修改操作。

#2、有的是移动操作，有的是复制操作

语法

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO
TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]

Local关键字的含义
- local表示是从本地文件系统加载数据到Hive的表中；
- 如果没有local表示的是从HDFS文件系统加载数据到Hive表中；
- 问本地指的是哪个本地？ HS2本地的文件系统

演示

--step1:建表
--建表student_local 用于演示从本地加载数据
create table student_local(num int,name string,sex string,age int,dept string) row format delimited fields terminated by ',';
--建表student_HDFS  用于演示从HDFS加载数据到非分区表
create table student_HDFS(num int,name string,sex string,age int,dept string) row format delimited fields terminated by ',';
--建表student_HDFS_p 用于演示从HDFS加载数据到分区表
create table student_HDFS_p(num int,name string,sex string,age int,dept string) partitioned by(country string) row format delimited fields terminated by ',';

--step2:加载数据
-- 从本地加载数据  数据位于HS2（node1）本地文件系统  本质是hadoop fs -put上传操作
--Loading data to table itheima.student_local from file:/root/hivedata/students.txt
LOAD DATA LOCAL INPATH '/root/hivedata/students.txt' INTO TABLE student_local;

--从HDFS加载数据  数据位于HDFS文件系统根目录下  本质是hadoop fs -mv 移动操作
--Loading data to table itheima.student_hdfs from hdfs://node1:8020/stu/students.txt

--先把数据上传到HDFS上  hadoop fs -put /root/hivedata/students.txt /
LOAD DATA INPATH '/students.txt' INTO TABLE student_HDFS;

----从HDFS加载数据到分区表中并指定分区  数据位于HDFS文件系统根目录下
--先把数据上传到HDFS上 hadoop fs -put /root/hivedata/students.txt /
LOAD DATA INPATH '/students.txt' INTO TABLE student_HDFS_p partition(country ="China");

总结：建议在HIve中使用load命令加载数据到表中，这也是Hive比Mysql加载数据高效地方所在。
其实不管是使用load还是直接使用hadoop fs,目标就是把文件放置在表对应的目录下。

知识点16：Apache Hive--DML--insert插入语法

探讨

--如果使用操作Mysql的思维来玩Hive会如何  使用insert+values的方式插入数据。

create table t_insert(id int,name string);

insert into table t_insert values(1,"allen");

--可以执行 但是效率极低 因为底层是通过MapReduce插入数据的  因此实际中推荐使用load加载数据

insert+select

在hive中，insert主要是结合 select 查询语句使用，将查询结果插入到表中。

保证后面select查询语句返回的结果字段个数、类型、顺序和待插入表一致；
如果不一致，Hive会尝试帮你转换，但是不保证成功；
insert+select也是在数仓中ETL数据常见的操作。

--step1:创建一张源表student
drop table if exists student;
create table student(num int,name string,sex string,age int,dept string)
row format delimited
fields terminated by ',';
--加载数据
load data local inpath '/root/hivedata/students.txt' into table student;

select * from student;

--step2：创建一张目标表  只有两个字段
create table student_from_insert(sno int,sname string);
--使用insert+select插入数据到新表中
insert into table student_from_insert
select num,name from student;

select *
from student_from_insert;

Multi Inserts 多重插入

功能：一次扫描，多次插入

栗子

create table source_table (id int, name string) row format delimited fields terminated by ',';

create table test_insert1 (id int) row format delimited fields terminated by ',';
create table test_insert2 (name string) row format delimited fields terminated by ',';


--普通插入：
insert into table test_insert1 select id from source_table;
insert into table test_insert2 select name from source_table;

--在上述需求实现中 从同一张表扫描了2次 分别插入不同的目标表中 性能低下。

--多重插入：
from source_table                     
insert overwrite table test_insert1 
select id
insert overwrite table test_insert2
select name;
--只需要扫描一次表  分别把不同字段插入到不同的表中即可 减少扫描次数 提高效率

Dynamic partition inserts 动态分区插入

何谓动态分区，静态分区

针对的是分区表。
--问题：分区表中分区字段值是如何确定的？

1、如果是在加载数据的时候人手动写死指定的  叫做静态分区 
load data local inpath '/root/hivedata/usa_dezhou.txt'  into table t_user_double_p partition(guojia="meiguo",sheng="dezhou");

2、如果是通过insert+select 动态确定分区值的，叫做动态分区
insert table partition (分区字段) +select

栗子

--1、首先设置动态分区模式为非严格模式 默认已经开启了动态分区功能
set hive.exec.dynamic.partition = true;
set hive.exec.dynamic.partition.mode = nonstrict;

--2、当前库下已有一张表student
select * from student;

--3、创建分区表 以sdept作为分区字段
create table student_partition(Sno int,Sname string,Sex string,Sage int) partitioned by(Sdept string);

--4、执行动态分区插入操作
insert into table student_partition partition(Sdept)
select num,name,sex,age,dept from student;
--其中，num,name,sex,age作为表的字段内容插入表中
--dept作为分区字段值

select *
from student_partition;

show partitions student_partition;

知识点17：Apache Hive--DML--insert导出数据操作

功能：把select查询的结果导出成为一个文件。
注意：导出操作是一个overwrite操作，可能会让你凉凉。慎重！！！！
- 致命栗子：insert overwrite directory '/' select num,name,age from student_hdfs limit 2;：根目录文件全部会被删掉=rm -rf /*

语法

--当前库下已有一张表student
select * from student_hdfs;

--1、导出查询结果到HDFS指定目录下
insert overwrite directory '/tmp/hive_export/e1' select num,name,age from student_hdfs limit 2;   --默认导出数据字段之间的分隔符是\001

--2、导出时指定分隔符和文件存储格式
insert overwrite directory '/tmp/hive_export/e2' row format delimited fields terminated by ','
stored as orc
select num,name,age from student_hdfs limit 2;

--3、导出数据到本地文件系统指定目录下
insert overwrite local directory '/root/hive_export/e1' select num,name,age from student_hdfs limit 2;

今日课程作业

#1、Hive DDL 建表语句  重点之重点
	数据类型
	SerDe序列化机制（Hive怎么读写HDFS文件的）
	分隔符指定
	默认分隔符
	内外部表
	存储路径location
	
	分区表
	分桶表
		分区、分桶是什么
		解决了什么问题
		如何创建
		如何加载
		如何使用
		注意事项
#2、Hive DDL Alter
	partition的增加、修改、修复

#3、常见的show
	库  
	表
	分区
	建表语句
	表元数据信息
	函数方法
    
#4、hive DML  重点之重点
	load
		local本地是哪里？
		如何理解load是一个纯操作？纯是什么意思？
		什么时候是复制  什么时候是加载
	insert
		insert+select
		多重插入  优化
		动态分区插入   #这个很重要
			什么叫动态 静态
			动态分区的严格模式  非严格模式是啥
		数据导出操作
			注意overwrite覆盖