Hive部署及簡單測驗
- Hive部署
- 下載
- 解壓安裝
- 上傳
- 解壓
- 安裝MySQL
- 修改Hive配置
- 放入Hive所需的MySQL驅動依賴包
- 配置Hive環境變數
- Hive首次啟動
- 啟動HDFS
- 啟動YARN
- 啟動Hive
- 創建HDFS目錄
- 初始化Hive元資料
- 正式啟動
- Hive實作wordcount
- 創建表
- 加載HDFS資料
- SQL分析處理
- Hive統計二手房
- 準備資料
- 創建Hive表
- 加載資料
- 加載本地資料
- 加載HDFS檔案
- SQL分析處理
- 關閉集群
- 關閉Hive
- 關閉HDFS
- 關閉YARN
Hive概述
Hive部署
筆者選擇部署Hive2.1.0,其他版本大同小異,在之前虛擬機的基礎上進行安裝,選擇安裝在node3,
下載
在Apache Hive的官網,也就是hive.apache.org下載安裝包,由于是外網,可能需要科學上網,
或者直接傳送門,筆者手頭已經有個舊版本的包了,就懶得下載了,一樣用,,,

解壓安裝
上傳
在node3:
cd /export/software/
rz
解壓
tar -zxvf apache-hive-2.1.0-bin.tar.gz -C /export/server/
cd /export/server/
mv apache-hive-2.1.0-bin hive-2.1.0-bin
為了使用方便,改個名,
安裝MySQL
由于Hive的元資料是保存在MySQL的,故需要安裝MySQL,筆者的node3已經安裝過了,

修改Hive配置
默認配置一般都不是想要的狀態,,,
cd /export/server/hive-2.1.0-bin/conf/
mv hive-env.sh.template hive-env.sh
vim hive-env.sh
在49行插入:
HADOOP_HOME=/export/server/hadoop-2.7.5

在52行插入:
export HIVE_CONF_DIR=/export/server/hive-2.1.0-bin/conf

使用:wq保存退出,
使用 ll -ah 查看當前目錄下的檔案:

手動改haive-default-xml.template的名稱及內容,筆者直接在win搞個hive-site.xml,內容先改成:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?><!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<configuration>
</configuration>
在configuration夾層中添加:
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>123456</value>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://node3:3306/hivemetadata?createDatabaseIfNotExist=true&useSSL=false</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
</property>
<property>
<name>datanucleus.schema.autoCreateAll</name>
<value>true</value>
</property>
<property>
<name>hive.server2.thrift.bind.host</name>
<value>node3</value>
</property>
然后rz上傳也可以實作同樣的效果:
放入Hive所需的MySQL驅動依賴包
筆者選用5.3.18的jar包,rz上傳后:
cp /export/software/mysql-connector-java-5.1.38.jar /export/server/hive-2.1.0-bin/lib/
cd /export/server/hive-2.1.0-bin/
ll ./lib
確認lib目錄下已經有該檔案,
配置Hive環境變數
vim /etc/profile
找個空白處:
#HIVE_HOME
export HIVE_HOME=/export/server/hive-2.1.0-bin
export PATH=:$PATH:$HIVE_HOME/bin

筆者直接放最后了,,,記得保存,
Hive首次啟動
首次啟動需要做一些配置,之后啟動就可以省去這些步驟,

筆者已經有一些命令了(忘了哪些是自己寫的shell腳本,,,讀者如果沒有,可以查閱筆者之前的稿子),
啟動HDFS
start-dfs.sh
啟動YARN
start-yarn.sh
啟動Hive
首次啟動需要初始化元資料,
之后啟動HIVE不再需要初始化元資料!!!
創建HDFS目錄
HIVE會自動把資料“搬”到HDFS的倉庫目錄,
hdfs dfs -mkdir -p /user/hive/warehouse
hdfs dfs -chmod g+w /tmp
hdfs dfs -chmod g+w /user/hive/warehouse
在node1:50070可以看到:



路徑/user/hive/warehouse用作Hive中所有表的資料在HDFS中的存盤目錄,
初始化Hive元資料
cd /export/server/hive-2.1.0-bin/
ll -ah ./bin
可以看到可用的命令:

使用:
bin/schematool -dbType mysql -initSchema
末尾會出現:
schemaTool completed
可以在MySQL中使用:
show databases;
看到:

說明Hive的MySQL資料庫創建完成,
正式啟動
cd /export/server/hive-2.1.0-bin/bin/
hive
啟動后:

啟動成功!!!
Hive實作wordcount
之前使用MapReduce寫個wordcount得花費好長時間,,,還得琢磨K2、V2、K3、V3,,,事實上,Hive可以使用SQL陳述句自動生成MapReduce程式,,,雖然MapReduce已經接近淘汰,并且Hive宣告未來的高本吧不再支持MapReduce,,,
創建表
在啟動Hive的命令列:

create table tb_word(
words string
);
由于Hive底層是Java,元資料是MySQL,故Hive的表資料可以直接使用Java的資料型別,
在啟動Hive的命令列:

show tables;
可以看出表格已經被創建,
加載HDFS資料
還是使用之前的那個wc.txt試試效果:

使用:
load data inpath '/wordcount/input/wc.txt' into table tb_word;
Hive的命令列會顯示:
hive> load data inpath '/wordcount/input/wc.txt' into table tb_word;
Loading data to table default.tb_word
OK
Time taken: 0.553 seconds
執行完畢后:

原來的資料wc.txt就被“搬”走了!!!

檔案被搬到了HDFS檔案系統的/user/hive/warehouse/tb_word目錄下,
SQL分析處理
在啟動了Hive的命令列分2次執行:
create table tb_word2 as select explode(split(words," ")) as word from tb_word;
select word,count(*) as numb from tb_word2 group by word order by numb desc;
開始運算:

在node3:8088可以看到:

等待完成:

Hive例外機智呢!!!MapReduce程式顯然是在Map階段尚未完成就開始執行Reduce,比之前筆者寫的MqpReduce高效多了,,,
Total MapReduce CPU Time Spent: 41 seconds 370 msec
OK
spark 17280000
hue 17280000
16700000
hbase 11520000
hadoop 10848000
hive 5760000
Time taken: 56.245 seconds, Fetched: 6 row(s)
這老爺機,,,廉頗老矣,,,
創建表是為了實作HDFS檔案與表的映射,SQL + 函式是實作計算處理需求,
hive> create table tb_word2 as select explode(split(words," ")) as word from tb_word;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20210501224748_d6a9f3dc-083d-45d4-bfae-5ba64722871d
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1619878300652_0001, Tracking URL = http://node3:8088/proxy/application_1619878300652_0001/
Kill Command = /export/server/hadoop-2.7.5/bin/hadoop job -kill job_1619878300652_0001
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
2021-05-01 22:48:00,140 Stage-1 map = 0%, reduce = 0%
2021-05-01 22:48:36,022 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 45.57 sec
2021-05-01 22:48:38,155 Stage-1 map = 75%, reduce = 0%, Cumulative CPU 48.58 sec
2021-05-01 22:49:05,266 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 70.35 sec
MapReduce Total cumulative CPU time: 1 minutes 10 seconds 350 msec
Ended Job = job_1619878300652_0001
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://node1:8020/user/hive/warehouse/.hive-staging_hive_2021-05-01_22-47-48_807_5979069376130498884-1/-ext-10002
Moving data to directory hdfs://node1:8020/user/hive/warehouse/tb_word2
MapReduce Jobs Launched:
Stage-Stage-1: Map: 2 Cumulative CPU: 70.35 sec HDFS Read: 380025735 HDFS Write: 363356169 SUCCESS
Total MapReduce CPU Time Spent: 1 minutes 10 seconds 350 msec
OK
Time taken: 78.89 seconds
hive> select word,count(*) as numb from tb_word2 group by word order by numb desc;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20210501224918_0fe4faa9-c8ba-4933-9190-c3b87922bd41
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 2
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1619878300652_0002, Tracking URL = http://node3:8088/proxy/application_1619878300652_0002/
Kill Command = /export/server/hadoop-2.7.5/bin/hadoop job -kill job_1619878300652_0002
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 2
2021-05-01 22:49:28,605 Stage-1 map = 0%, reduce = 0%
2021-05-01 22:49:48,103 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 19.28 sec
2021-05-01 22:49:52,403 Stage-1 map = 67%, reduce = 0%, Cumulative CPU 24.98 sec
2021-05-01 22:50:02,052 Stage-1 map = 67%, reduce = 8%, Cumulative CPU 34.27 sec
2021-05-01 22:50:04,150 Stage-1 map = 100%, reduce = 17%, Cumulative CPU 36.92 sec
2021-05-01 22:50:05,189 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 39.34 sec
MapReduce Total cumulative CPU time: 39 seconds 340 msec
Ended Job = job_1619878300652_0002
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1619878300652_0003, Tracking URL = http://node3:8088/proxy/application_1619878300652_0003/
Kill Command = /export/server/hadoop-2.7.5/bin/hadoop job -kill job_1619878300652_0003
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
2021-05-01 22:50:12,568 Stage-2 map = 0%, reduce = 0%
2021-05-01 22:50:13,635 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 2.03 sec
MapReduce Total cumulative CPU time: 2 seconds 30 msec
Ended Job = job_1619878300652_0003
MapReduce Jobs Launched:
Stage-Stage-1: Map: 2 Reduce: 2 Cumulative CPU: 39.34 sec HDFS Read: 363374386 HDFS Write: 349 SUCCESS
Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 2.03 sec HDFS Read: 8680 HDFS Write: 589741 SUCCESS
Total MapReduce CPU Time Spent: 41 seconds 370 msec
OK
spark 17280000
hue 17280000
16700000
hbase 11520000
hadoop 10848000
hive 5760000
Time taken: 56.245 seconds, Fetched: 6 row(s)
明明只有2條SQL陳述句,喂貓執行了3個MapReduce程式?
第一行SQL是拆分欄位建立臨時表并將臨時表插入到新表,
第二行SQL是先分組計數,再按照統計的個數來降序排序,如果自己寫MapReduce程式,也得先寫個分組器完成計數,再將第一次計算的結果進行排序,
Hive能完成的把SQL陳述句轉換為MapRedue功能,人工完全可以做,且做法基本一致,Hive只是起到了自動生成代碼的功能,著實是個很強大的工具,
Hive統計二手房
之前統計二手房資料的csv檔案也是畫了很多時間寫MapReduce,事實上使用Hive很容易就能實作,
準備資料
由于HDFS是分布式的,可以在node1中:
hdfs dfs -mkdir -p /export/data/
創建個目錄,
cd /export/data/
利用rz上傳這個檔案:

上傳到HDFS中:
hdfs dfs -put /export/data/secondhouse.csv /export/data
在node1:50070可以看到:

上傳成功,
創建Hive表
在啟動了Hive的命令列使用:
create table tb_house(
xiaoqu string,
huxing string,
area double,
region string,
floor string,
fangxiang string,
t_price int,
s_price int,
buildinfo string
) row format delimited fields terminated by ',';
加載資料
加載本地資料
在啟動了Hive的命令列使用:
load data local inpath '/export/data/secondhouse.csv' into table tb_house;
這種方式是加載本地檔案,由于此時node3的前臺程式Hive占用了命令列,筆者想偷個懶,之前是上傳到了node1的該目錄,當然是訪問不到的,,,會報錯:
hive> load data local inpath '/export/data/secondhouse.csv' into table tb_house;
FAILED: SemanticException Line 1:23 Invalid path ''/export/data/secondhouse.csv'': No files matching path file:/export/data/secondhouse.csv
先在node1的命令列:
cd /export/data/
scp -r secondhouse.csv node3:$PWD
這句話一定要注意PWD是大寫!!!不信可以在閑置的node2使用echo $pwd 和 echo $PWD分別查看,PWD小寫是不能顯示內容的!!!會導致分發路徑有誤!!!
再次加載本地資料即可成功,寫local就是標注是從Linux本地檔案中讀取,
加載HDFS檔案
由于之前已經上傳到了HDFS中,在啟動了Hive的命令列使用:
load data inpath '/export/data/secondhouse.csv' into table tb_house;
即可直接加載資料,不寫local,就是默認情況,從HDFS加載,
這2種方式均可,但是考慮到大資料的資料量都不會小,當然是優先使用加載HDFS檔案!!!小檔案測驗使用加載本地檔案的方式要方便些,
SQL分析處理
在啟動了Hive的命令列(當然還是node3)使用:
select
region,
count(*) as numb,
round(avg(s_price),2) as avgprice,
max(s_price) as maxprice,
min(s_price) as minprice
from tb_house
group by region;
這次資料量很小,計算很快:

有了Hive這個大殺器,實作MapReduce還是很便捷的,
關閉集群
關閉Hive
在啟動了Hive的命令列(當然還是node3)使用:
quit;
一定要有分號!!!
關閉HDFS
stop-dfs.sh
關閉YARN
stop-yarn.sh
由于沒有使用ZooKeeper,jps查看行程也沒有它,基本結束了所有行程,先留個快照再關閉,
轉載請註明出處,本文鏈接:https://www.uj5u.com/qita/282591.html
標籤:其他
