大数据之Hive

Hive

Author: Lijb

Email:

Hive介绍：

  hive是基于Hadoop的一个数据仓库工具，可以用来进行数据踢群转换加载（ETL）,这是一种可以存储、查询和分析存储在Hadoop中的大规模数据机制。可以将结构化的数据文件映射为一张数据库表，并提供简单的sql查询功能，可以将sql语句转换为MapReduce任务进行运行。

ETL介绍：

什么是etl Extract-Transform-Load):

1、用来描述将数据从来源端经过抽取（extract）、转换（transform）、加载（load）至目的端的过程。ETL一词较常用在数据仓库。是一个数据清洗工具
2、实现ETL，首先要实现ETL转换的过程。体现为以下几个方面：
	1、空值处理：可捕获字段空值，进行加载或替换为其他含义数据，并可根据字段空值实现分流加载到不同目标库。
	2、规范化数据格式：可实现字段格式约束定义，对于数据源中时间、数值、字符等数据，可自定义加载格式。
	3、拆分数据：依据业务需求对字段可进行分解。例，主叫号 861082585313-8148，可进行区域码和电话号码分解。
	4、验证数据正确性：可利用Lookup及拆分功能进行数据验证。例如，主叫号861082585313-8148，进行区域码和电话号码分解后，可利用Lookup返回主叫网关或交换机记载的主叫地区，进行数据验证。
	5、数据替换：对于因业务因素，可实现无效数据、缺失数据的替换。
	6、Lookup：查获丢失数据 Lookup实现子查询，并返回用其他手段获取的缺失字段，保证字段完整性。
	7、建立ETL过程的主外键约束：对无依赖性的非法数据，可替换或导出到错误数据文件中，保证主键唯一记录的加载。

Hive的架构图

driver是hive中内置的核心驱动driver又包括三个组件，Compiler/Optimizer/Excutor

Compiler:会把我们写的hql编译成原生的mapreduce程序
Optimizer：优化器把hql编译成MR以后，MR会执行提交，Optimizer：会在MR执行提交之前对程序进行一些优化。优化无非就是在job里面添加很多参数或者配置等策略。
Excutor：执行器，把优化后的mapreduce程序交给hadoop的resourcemanager去处理
			总结：编译--优化--提交到Hadoop的亚组件。
Hive操作方式CLI/JDBC(ODBC)/Web(GUI)
CLI:通过客户端指令去操作
JDBC：sql管理起来 通过jdbc去操作
Web:通过访问web界面的方式去访问hive
Thrift server:对象的序列化方式，可能jdbc和hive是写到不同的数据库节点上，所以跨界点通信的话要进行序列化.
Metastore:用来存储hive中表的元数据的工具,也就是存元数据的指令
		元数据:表的名字，表的列和分区及其属性，表的属性（是否为外部表等），以及表所在的数据存储目录等等
derby:数据库，hive内置了derby数据库。该数据库只支持单会话。数据没法共享
mysql：数据可以共享

Hive的安装

安装mysql并开启远程访问
安装Hadoop环境并启动
安装Hive

[root ~]# tar -zxf apache-hive-1.2.1-bin.tar.gz -C /usr/ [root ~]# vi /usr/apache-hive-1.2.1-bin/conf/hive-site.xml

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
```
  <property>
      <name>javax.jdo.option.ConnectionURL</name>
      <value>jdbc:mysql://CentOS:3306/hive</value>
  </property>
  <property>
      <name>javax.jdo.option.ConnectionDriverName</name>
      <value>com.mysql.jdbc.Driver</value>
  </property>
  <property>
      <name>javax.jdo.option.ConnectionUserName</name>
      <value>root</value>
  </property>
      <property>
      <name>javax.jdo.option.ConnectionPassword</name>
  <value>root</value>
  </property>
```
</configuration>

mysql> create database hive; Query OK, 1 row affected (0.00 sec) mysql> show create database hive; +---------- +----------------------------------------------------------------- + | Database | Create Database | +---------- +----------------------------------------------------------------- + | hive | CREATE DATABASE hive /*!40100 DEFAULT CHARACTER SET latin1 */ | +---------- +----------------------------------------------------------------- + 1 row in set (0.00 sec) mysql> quit Bye [root ~]# cp mysql-connector-java-5.1.46.jar /usr/apache-hive-1.2.1-bin/lib/ [root ~]# cd /usr/apache-hive-1.2.1-bin/ [root@centos apache-hive-1.2.1-bin]# cp lib/jline-2.12.jar /usr/hadoop-2.6.0/share/hadoop/yarn/lib/ [root@centos apache-hive-1.2.1-bin]# rm -rf /usr/hadoop-2.6.0/share/hadoop/yarn/lib/jline-0.9.94.jar

启动Hive(单机|管理员)

[root@centos apache-hive-1.2.1-bin]# ./bin/hive
Logging initialized using configuration in jar:file:/usr/apache-hive-1.2.1-bin/lib/hive-common-
1.2.1.jar!/hive-log4j.properties
hive>

多用户访问

[root@centos apache-hive-1.2.1-bin]# ./bin/hiveserver2 (服务端)
[root@centos apache-hive-1.2.1-bin]# ./bin/beeline -u jdbc:hive2://CentOS:10000 -n root
Connecting to jdbc:hive2://CentOS:10000
Connected to: Apache Hive (version 1.2.1)
Driver: Hive JDBC (version 1.2.1)
Transaction isolation: TRANSACTION_REPEATABLE_READ Beeline version 1.2.1 by Apache Hive

Hive对表的操作

表分类

HIVE的表按类型分为管理表、外部表、分区表。

管理表:前面章节案例中所创建的表都称为管理表，有时候也称为内部表。因为这种表hive会控制数据的生命周期（在删除表的同时也会删除dhfs上的文本数据）因此管理表不方便和其他工作共享数据。
外部表：和管理表比较外部表在创建的时候多一个中external关键字告诉Hive这是外部表，而后面的location则是告诉hive数据位于哪个位置或者路径下。因为是外部表hive并非完全拥有这些数据，因此在删除外部表的时候并不会删除location参数指定路径的数据。
分区表：无论外部表还是管理表都可以在创建的时候指定分区，对于这种指定分区的表称为分区表。数据库分区的概念已经纯在很久，其可以有多种形式，但是通常是使用分区来水平分散压力，将数据从物理上转移和使用最频繁的用户更接近的地方，以及实现其他目的。

外部表

1,zhangsan,true,18,15000,TV|Game,001>建设|002>招商,china|bj
2,lisi,true,28,15000,TV|Game,001>建设|002>招商,china|bj
3,wangwu,false,38,5000,TV|Game,001>建设|002>招商,china|sh
\------
create external table t_user_c(
id int,
name string,
sex boolean,
age int,
salary double,
hobbies array<string>,
card map<string,string>,
addressstruct<country:string,city:string>
)
row format delimited
fields terminated by ','
collection items terminated by '|'
map keys terminated by '>'
lines terminated by '\n';
create external table t_user_a(
id int,
name string,
sex boolean,
age int,
salary double,
hobbies array<string>,
card map<string,string>,
addressstructcountry:string,city:string
)
row format delimited
fields terminated by ','
collection items terminated by '|'
map keys terminated by '>'
lines terminated by '\n'
location '/user/hive/warehouse/baizhi.db/t_user_c';

分区表

create external table t_user(
id int,
name string,
sex boolean,
age int,
salary double,
country string,
city string
)
row format delimited
fields terminated by ','
collection items terminated by '|'
map keys terminated by '>'
lines terminated by '\n'
--------------------
create external table t_user_p(
id int,
name string,
sex boolean,
age int,
salary double
)
partitioned by(country string,city string)
row format delimited
fields terminated by ','
collection items terminated by '|'
map keys terminated by '>'
lines terminated by '\n'

Hive On Hbase

CREATE EXTERNAL tablet_user_hbase(
id    string,
name    string,
age  int,
salary    int,
company string
)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITHSERDEPROPERTIES('hbase.columns.mapping' = ':key,cf1:name,cf1:age,cf1:salary,cf1:company')TBLPROPERTIES('hbase.table.name' = 'zpark:t_user');

数据类型

①primitive(原始类型)：

整数：TINYINT、SMALLINT、INT、BIGINT

布尔：BOOLEAN

小数：FLOAT、DOUBLE

字符：STRING、CHAR、VARCHAR

二进制：BINARY

时间类型：TIMESTAMP、DATE

②array（数组类型）：ARRAY < data_type >

③map（key-value类型）：MAP< primitive_type, data_type >

④struct（结构体类型）：STRUCT <col_name:data_type, ...>

建表、建立映射关系

create table t_user01(id int,name varchar(40))
load data local inpath '/root/user1.log' into table t_user;

load data local inpath '/root/user2.log' overwrite into table t_user2;
create table t_order( id int,name varchar(32), num int, price double, tags array<string>, user_id int

默认分割符

分隔符描述
\n 对于文本来说，每一行都是一条记录。因此\n可以分割记录。
^A(Ctrl+a) 用于分割字段（列），在create table中可以使用\001表示。
^B(Ctrl+b) 用于分割array或者是struct中的元素或者用于map结构中的k-v对的分隔符，在create table
中可以使用\002表示。
^C(Ctrl+c) 用于Map中k-v的分隔符，在create table中可以使用\003表示。

create table t_user(
id int,
name string,
sex boolean,
birthDay date,
salary double,
hobbies array<string>,
card map<string,string>,
address struct<country:string,city:string>
)
0: jdbc:hive2://CentOS:10000> desc formatted t_user;

准备数据

将数据导入到表中

#不常用
[root@centos ~]# hdfs dfs -put t_user /user/hive/warehouse/baizhi.db/t_user //使用hive加载数据
0: jdbc:hive2://CentOS:10000> load data local inpath '/root/t_user' overwrite into table t_user;

自定义分隔符（用的较多）

1,zhangsan,true,18,15000,TV|Game,001>建设|002>招商,china|bj
2,lisi,true,28,15000,TV|Game,001>建设|002>招商,china|bj
3,wangwu,false,38,5000,TV|Game,001>建设|002>招商,china|sh
------
create table t_user_c(
    id int,
    name string,
    sex boolean,
    age int,
    salary double,
    hobbies array<string>,
    card map<string,string>,
    address struct<country:string,city:string>
)
row format delimited
fields terminated by ','
collection items terminated by '|'
map keys terminated by '>'
lines terminated by '\n';

正则匹配

192.168.0.1 qq com.xx.xx.XxxService#xx 2018-10-10 10:10:00
192.168.2.1 qq com.xx.xx.XxxService#xx 2018-10-10 10:10:00
192.168.0.1 xx com.xx.xx.XxxService#xx 2018-10-10 10:10:00
192.168.202.1 qq com.xx.xx.XxxService#xx 2018-10-10 10:10:00
192.168.2.1 qq com.xx.xx.XxxService#xx 2018-10-10 10:10:00
192.168.0.2 xx com.xx.xx.XxxService#xx 2018-10-10 10:10:00
192.168.0.2 qq com.xx.xx.XxxService#xx 2018-10-10 10:10:00
192.168.2.4 qq com.xx.xx.XxxService#xx 2018-10-10 10:10:00
192.168.0.4 xx com.xx.xx.XxxService#xx 2018-10-10 10:10:00
---------------------
create table t_access(
    ip string,
    app varchar(32),
    service string,
    last_time string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex"="^(.*)\\s(.*)\\s(.*)\\s(.*\\s.*)"
);

CSV格式文件

1,zhangsan,TRUE,20
2,zhangsan,TRUE,21
3,zhangsan,TRUE,22
4,zhangsan,TRUE,23
------------------
CREATE TABLE my_table(
    id int,
    name string,
    sex boolean,
    age int
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"escapeChar"	= "\\"
);

JSON格式数据

{"id":1,"name":"zhangsan","sex":true,"birth":"1991-02-08"}
{"id":2,"name":"lisi","sex":true,"birth":"1991-02-08"}
---------------------
ADD|DELETE JAR /usr/apache-hive-1.2.1-bin/hcatalog/share/hcatalog/hive-hcatalog-core-1.2.1.jar ; create table t_user_json(
    id int,
    name varchar(32),
    sex boolean,
    birth date
)ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';

数据的导入/导出（管理员）

INSERT OVERWRITE DIRECTORY /logs/ ROW FORMAT delimited fields terminated by ',' SELECT ip , app,last_time from t_access;
0: jdbc:hive2://CentOS:10000> create table t(id string ,name string);
0: jdbc:hive2://CentOS:10000> INSERT into t(id,name) values(1,'zs'),(2,'ww');
0: jdbc:hive2://CentOS:10000> insert into table t select ip,last_time from t_access;
0: jdbc:hive2://CentOS:10000> create table temp1 as select ip,last_time from t_access;

表分类

HIVE的表按类型分为管理表、外部表、分区表。

管理表:前面章节案例中所创建的表都称为管理表，有时候也称为内部表。因为这种表hive会控制数据的生命周期（在删除表的同时也会删除hdfs上的文本数据）因此管理表不方便和其他工作共享数据。
外部表：和管理表比较外部表在创建的时候多一个中external关键字告诉Hive这是外部表，而后面的location则是告诉hive数据位于哪个位置或者路径下。因为是外部表hive并非完全拥有这些数据，因此在删除外部表的时候并不会删除location参数指定路径的数据。
分区表：无论外部表还是管理表都可以在创建的时候指定分区，对于这种指定分区的表称为分区表。数据库分区的概念已经存在很久，其可以有多种形式，但是通常是使用分区来水平分散压力，将数据从物理上转移和使用最频繁的用户更接近的地方，以及实现其他目的。

外部表

1,zhangsan,true,18,15000,TV|Game,001>建设|002>招商,china|bj
2,lisi,true,28,15000,TV|Game,001>建设|002>招商,china|bj
3,wangwu,false,38,5000,TV|Game,001>建设|002>招商,china|sh
\------
create external table t_user_c(
id int,
name string,
sex boolean,
age int,
salary double,
hobbies array<string>,
card map<string,string>,
addressstruct<country:string,city:string>
)
row format delimited
fields terminated by ','
collection items terminated by '|'
map keys terminated by '>'
lines terminated by '\n';
create external table t_user_a(
id int,
name string,
sex boolean,
age int,
salary double,
hobbies array<string>,
card map<string,string>,
addressstructcountry:string,city:string
)
row format delimited
fields terminated by ','
collection items terminated by '|'
map keys terminated by '>'
lines terminated by '\n'
location '/user/hive/warehouse/baizhi.db/t_user_c';

分区表

create external table t_user(
id int,
name string,
sex boolean,
age int,
salary double,
country string,
city string
)
row format delimited
fields terminated by ','
collection items terminated by '|'
map keys terminated by '>'
lines terminated by '\n'
--------------------
create external table t_user_p(
id int,
name string,
sex boolean,
age int,
salary double
)
partitioned by(country string,city string)
row format delimited
fields terminated by ','
collection items terminated by '|'
map keys terminated by '>'
lines terminated by '\n'

Hive On Hbase

CREATE EXTERNAL tablet_user_hbase(
id    string,
name    string,
age  int,
salary    int,
company string
)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITHSERDEPROPERTIES('hbase.columns.mapping' = ':key,cf1:name,cf1:age,cf1:salary,cf1:company')TBLPROPERTIES('hbase.table.name' = 'zpark:t_user');

Linux上刷新mysql权限：flush privileges;

启动mysql: service mysqld start e

进入MySQL：mysql -uroot -p

使用数据库： use test

查看表: show tables

...