HDFS資料安全與Java API的簡單使用
- HDFS資料安全
- 元資料安全
- 元資料產生
- 元資料存盤
- SecondaryNameNode
- Java API的簡單使用
- 應用場景
- 相關配置
- Maven配置
- 本地開發環境配置
- 集群啟動
- 啟動ZooKeeper
- 啟動HDFS
- 啟動YARN
- 構建連接
- 釋放資源
- 獲取集群資訊
- 創建目錄及列舉
- 上傳及下載
- 合并上傳
- 權限
- 集群關機
- 關閉HDFS
- 關閉YARN
- 關閉ZooKeeper
- 斷電
先看這2篇:
ZooKeeper概述
HDFS概述
HDFS資料安全
元資料安全
元資料產生
格式化的時候就會產生磁盤元資料檔案,在node1使用:
cd /export/server/hadoop-2.7.5/hadoopDatas/namenodeDatas/current/
切換目錄并ll -ah查看:
[root@node1 current]# cd /export/server/hadoop-2.7.5/hadoopDatas/namenodeDatas/current/
[root@node1 current]# ll -ah
總用量 24K
drwxr-xr-x 2 root root 222 4月 25 23:12 .
drwxr-xr-x 3 root root 40 4月 25 21:36 ..
-rw-r--r-- 1 root root 0 4月 25 23:12 edits.xml
-rw-r--r-- 1 root root 3.3K 4月 25 21:56 fsimage_0000000000000000501
-rw-r--r-- 1 root root 62 4月 25 21:56 fsimage_0000000000000000501.md5
-rw-r--r-- 1 root root 3.5K 4月 25 22:56 fsimage_0000000000000000519
-rw-r--r-- 1 root root 62 4月 25 22:56 fsimage_0000000000000000519.md5
-rw-r--r-- 1 root root 0 4月 25 23:10 fsimage.xml
-rw-r--r-- 1 root root 4 4月 25 22:56 seen_txid
-rw-r--r-- 1 root root 203 4月 25 21:36 VERSION
這些fsimage就是元資料檔案,
元資料存盤
元資料存盤在NameNode維護的記憶體中,在磁盤中還有fsimage檔案(HDFS首次格式化時產生,用以持久化元資料檔案),NameNode啟動時會被加載到記憶體,但是NameNode需要經常讀寫元資料,如果元資料都存盤在硬碟的檔案中會導致讀寫性能極差,都存盤在記憶體中,如果宕機重啟,原先存盤在記憶體的資料會大量丟失,
∴需要edits檔案,將記憶體中的元資料的變化記錄在deits檔案中,宕機重啟時,NameNode啟動時會將fsimage檔案與edits檔案合并,生成原來的資料,有點像增量保存,或者快照,
SecondaryNameNode
如果長時間開機,edits檔案的體積會變得很大,由于記錄的是變化情況,時間久遠的大量無用資料很占用硬碟,NameNode啟動時還會從最開始一步一步恢復狀態,很多步驟顯然是多余的,
此時就需要SecondaryNameNone,階段性地合并fsimage檔案和edits檔案,生成最新的fsimage檔案,當下次NameNode啟動時,只需要加載最新的fsimage檔案和少量的edits檔案的內容即可快速完成元資料的恢復,
沒有SecondaryNameNode集群照樣可以跑起來,但是會導致集群啟動越來越慢,實際上,由于一般使用HA模式確保資料安全性,更愿意使用閑置的NameNode(Standby狀態)代替SecondaryNameNode的功能,
Java API的簡單使用
應用場景
使用命令列Client一般用作管理類操作,大規模讀寫當然不可能使用手動命令列讀寫的方式,會累死人的,,,需要大規模讀寫大量資料的適合顯然需要通過編程的方式自動進行,
一般使用分布式計算程式封裝HDFS Java API,然后利用分布式計算程式實作對HDFS資料的讀寫,
相關配置
Maven配置
在新專案的pom.xml添加依賴:
<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.7.5</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.7.5</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>2.7.5</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-core</artifactId>
<version>2.7.5</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.13</version>
</dependency>
</dependencies>
鎖定編譯版本為JDK1.8:
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.2</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
<encoding>UTF-8</encoding>
</configuration>
</plugin>
</plugins>
</build>
最后記得把log4j.properties拖到resources里,
本地開發環境配置
先配置win10的環境變數:

筆者把Hadoop包放C盤了,故新建HADOOP_HOME和C:\Program Files\bigdatastudy\hadoop2.7.5,
在Path里新建:C:\Program Files\bigdatastudy\hadoop2.7.5\bin:

集群啟動
由于筆者的集群宕機了:ens33網卡丟失,無奈reboot,沒辦法從掛起狀態直接恢復了,只好重新啟動,,,
啟動ZooKeeper
3臺虛擬機都使用cd /export/server/zookeeper-3.4.6/切換目錄,
使用
bin/zkServer.sh status
查看ZooKeeper狀態,未啟動則在node1使用:
bin/zkServer.sh start
啟動ZooKeeper服務,任何再次查看ZooKeeper的狀態:
[root@node1 zookeeper-3.4.6]# bin/zkServer.sh status
JMX enabled by default
Using config: /export/server/zookeeper-3.4.6/bin/../conf/zoo.cfg
Mode: follower
[root@node2 zookeeper-3.4.6]# bin/zkServer.sh status
JMX enabled by default
Using config: /export/server/zookeeper-3.4.6/bin/../conf/zoo.cfg
Mode: leader
[root@node3 zookeeper-3.4.6]# bin/zkServer.sh status
JMX enabled by default
Using config: /export/server/zookeeper-3.4.6/bin/../conf/zoo.cfg
Mode: follower
此時node1和node3為follower,node
2為leader,狀態正常,
啟動HDFS
node1使用:
start-dfs.sh
即可啟動HDFS:
[root@node1 zookeeper-3.4.6]# start-dfs.sh
Starting namenodes on [node1]
node1: starting namenode, logging to /export/server/hadoop-2.7.5/logs/hadoop-root-namenode-node1.out
node3: starting datanode, logging to /export/server/hadoop-2.7.5/logs/hadoop-root-datanode-node3.out
node2: starting datanode, logging to /export/server/hadoop-2.7.5/logs/hadoop-root-datanode-node2.out
node1: starting datanode, logging to /export/server/hadoop-2.7.5/logs/hadoop-root-datanode-node1.out
Starting secondary namenodes [node1]
node1: starting secondarynamenode, logging to /export/server/hadoop-2.7.5/logs/hadoop-root-secondarynamenode-node1.out
啟動YARN
node1使用:
start-yarn.sh
即可啟動YARN:
[root@node1 zookeeper-3.4.6]# start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to /export/server/hadoop-2.7.5/logs/yarn-root-resourcemanager-node1.out
node2: starting nodemanager, logging to /export/server/hadoop-2.7.5/logs/yarn-root-nodemanager-node2.out
node3: starting nodemanager, logging to /export/server/hadoop-2.7.5/logs/yarn-root-nodemanager-node3.out
node1: starting nodemanager, logging to /export/server/hadoop-2.7.5/logs/yarn-root-nodemanager-node1.out
3臺機都使用jps查看行程:
[root@node1 zookeeper-3.4.6]# jps
2000 NameNode
2560 NodeManager
2704 Jps
1830 QuorumPeerMain
2138 DataNode
2301 SecondaryNameNode
[root@node2 zookeeper-3.4.6]# jps
2066 NodeManager
1956 DataNode
1852 QuorumPeerMain
2189 Jps
[root@node3 zookeeper-3.4.6]# jps
2160 DataNode
2393 Jps
2013 QuorumPeerMain
2270 NodeManager
集群的啟動也是件麻煩事,,,貌似有必要重新寫個一鍵啟動的shell腳本了,,,
構建連接
在new新物件時,一定要導對包(導Hadoop的包)

先構建檔案系統的連接物件:
FileSystem fs = null;
然后構建連接的實體:
@Before
public void getFS() throws Exception {
//構建Configuration物件,每個Hadoop都需要物件,用于管理當前程式的所有配置
Configuration conf = new Configuration();
conf.set("fs.defaultFS", "hdfs://node1:8020");
//構建檔案系統實體
fs = FileSystem.get(conf);//給定配置,必須知道服務端地址
//fs = FileSystem.get(new URI("hdfs://node1:8020"),conf);//給定配置以及服務端地址
//fs = FileSystem.get(new URI("hdfs://node1:8020"),conf,"root");//給定配置、服務端地址、用戶身份
}
這一步,可以使用組態檔,也可以手動使用conf.set()方法逐一設定,
如果組態檔中沒有寫Server地址,或者需要強制使用Linux的用戶身份,就需要后2種方法(默認按照組態檔,用戶身份為當前的Windows用戶),
釋放資源
由于每一步測驗都新建了物件,為了防止程式結束后沒有回收資源導致發生埠擠占等后果,影響程式運行,先把最后一步寫好:
@After
public void closeFS() throws IOException {
fs.close();
}
之后的測驗段代碼就可以放在@Before和@After之間,
獲取集群資訊
//列印每個DataNode節點的狀態資訊
@Test
public void printDNinfo() throws IOException {
//集群管理,必須構建分布式檔案系統物件
DistributedFileSystem dfs = (DistributedFileSystem) this.fs;
//呼叫方法
DatanodeInfo[] dataNodeStats = dfs.getDataNodeStats();
//遍歷輸出iter
for (DatanodeInfo dataNodeStat : dataNodeStats) {
System.out.println("dataNodeStat.getDatanodeReport() = " + dataNodeStat.getDatanodeReport());
}
}
運行后:
dataNodeStat.getDatanodeReport() = Name: 192.168.88.9:50010 (node1)
Hostname: node1
Decommission Status : Normal
Configured Capacity: 37688381440 (35.10 GB)
DFS Used: 1134592 (1.08 MB)
Non DFS Used: 3649200128 (3.40 GB)
DFS Remaining: 34038046720 (31.70 GB)
DFS Used%: 0.00%
DFS Remaining%: 90.31%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Sun Apr 25 21:59:54 CST 2021
dataNodeStat.getDatanodeReport() = Name: 192.168.88.10:50010 (node2)
Hostname: node2
Decommission Status : Normal
Configured Capacity: 37688381440 (35.10 GB)
DFS Used: 1134592 (1.08 MB)
Non DFS Used: 3030994944 (2.82 GB)
DFS Remaining: 34656251904 (32.28 GB)
DFS Used%: 0.00%
DFS Remaining%: 91.95%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Sun Apr 25 21:59:54 CST 2021
dataNodeStat.getDatanodeReport() = Name: 192.168.88.11:50010 (node3)
Hostname: node3
Decommission Status : Normal
Configured Capacity: 37688381440 (35.10 GB)
DFS Used: 1134592 (1.08 MB)
Non DFS Used: 3136466944 (2.92 GB)
DFS Remaining: 34550779904 (32.18 GB)
DFS Used%: 0.00%
DFS Remaining%: 91.67%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Sun Apr 25 21:59:54 CST 2021
Process finished with exit code 0
看樣子,宕機重啟之后問題不大,,,
創建目錄及列舉
//創建目錄及列舉查看
@Test
public void mkdirAndList() throws Exception{
//構建創建的路徑物件
Path path = new Path("/bigdata");
//判斷目錄是否存在
if(fs.exists(path)){
//如果存在先洗掉
fs.delete(path,true);
}
//創建
fs.mkdirs(path);
//列舉檔案/目錄的狀態
FileStatus[] fileStatuses = fs.listStatus(new Path("/"));
//遍歷輸出iter
for (FileStatus fileStatus : fileStatuses) {
System.out.println("fileStatus.getPath().toString() = " + fileStatus.getPath().toString());
}
}
執行后:
fileStatus.getPath().toString() = hdfs://node1:8020/bigdata
fileStatus.getPath().toString() = hdfs://node1:8020/tmp
fileStatus.getPath().toString() = hdfs://node1:8020/user
fileStatus.getPath().toString() = hdfs://node1:8020/wordcount
Process finished with exit code 0
瀏覽器打開node1:50070:

是以本機用戶名創建的目錄,,,有時候為了避免出問題,就會使用上文構建連接時的其它方式創建物件,
也可以用另一種方式(迭代器)來遍歷所有的檔案:
//只能遍歷檔案
RemoteIterator<LocatedFileStatus> listFiles = fs.listFiles(new Path("/tmp"), true);
while (listFiles.hasNext()) {
System.out.println("listFiles.next().getPath().toString() = " + listFiles.next().getPath().toString());
}
執行后:
listFiles.next().getPath().toString() = hdfs://node1:8020/tmp/hadoop-yarn/staging/history/done_intermediate/root/job_1619179579492_0001-1619180910263-root-word+count-1619180974197-3-1-SUCCEEDED-default-1619180919881.jhist
listFiles.next().getPath().toString() = hdfs://node1:8020/tmp/hadoop-yarn/staging/history/done_intermediate/root/job_1619179579492_0001.summary
listFiles.next().getPath().toString() = hdfs://node1:8020/tmp/hadoop-yarn/staging/history/done_intermediate/root/job_1619179579492_0001_conf.xml
listFiles.next().getPath().toString() = hdfs://node1:8020/tmp/hadoop-yarn/staging/history/done_intermediate/root/job_1619179579492_0002-1619181906959-root-hadoop%2Dmapreduce%2Dclient%2Djobclient%2D2.7.5%2Dtests.jar-1619181942320-10-1-SUCCEEDED-default-1619181915720.jhist
listFiles.next().getPath().toString() = hdfs://node1:8020/tmp/hadoop-yarn/staging/history/done_intermediate/root/job_1619179579492_0002.summary
listFiles.next().getPath().toString() = hdfs://node1:8020/tmp/hadoop-yarn/staging/history/done_intermediate/root/job_1619179579492_0002_conf.xml
listFiles.next().getPath().toString() = hdfs://node1:8020/tmp/hadoop-yarn/staging/history/done_intermediate/root/job_1619179579492_0003-1619182045083-root-hadoop%2Dmapreduce%2Dclient%2Djobclient%2D2.7.5%2Dtests.jar-1619182068279-10-1-SUCCEEDED-default-1619182049875.jhist
listFiles.next().getPath().toString() = hdfs://node1:8020/tmp/hadoop-yarn/staging/history/done_intermediate/root/job_1619179579492_0003.summary
listFiles.next().getPath().toString() = hdfs://node1:8020/tmp/hadoop-yarn/staging/history/done_intermediate/root/job_1619179579492_0003_conf.xml
listFiles.next().getPath().toString() = hdfs://node1:8020/tmp/logs/root/logs/application_1619179579492_0001/node3_39678
listFiles.next().getPath().toString() = hdfs://node1:8020/tmp/logs/root/logs/application_1619179579492_0002/node1_45723
listFiles.next().getPath().toString() = hdfs://node1:8020/tmp/logs/root/logs/application_1619179579492_0002/node2_38036
listFiles.next().getPath().toString() = hdfs://node1:8020/tmp/logs/root/logs/application_1619179579492_0002/node3_39678
listFiles.next().getPath().toString() = hdfs://node1:8020/tmp/logs/root/logs/application_1619179579492_0003/node1_45723
listFiles.next().getPath().toString() = hdfs://node1:8020/tmp/logs/root/logs/application_1619179579492_0003/node2_38036
listFiles.next().getPath().toString() = hdfs://node1:8020/tmp/logs/root/logs/application_1619179579492_0003/node3_39678
Process finished with exit code 0
為神馬使用這種遠程迭代器?遠程迭代器的好處這一篇有解釋過,

結果顯然是正確的,
上傳及下載
//實作檔案的上傳與下載
@Test
public void uploadAndDownload() throws Exception {
//上傳:將本地檔案放入HDFS
Path localPath1 = new Path("file:///E:\\bigdata\\hello.txt");
Path hdfsPath1 = new Path("/bigdata");
fs.copyFromLocalFile(localPath1,hdfsPath1);
//下載:將HDFS檔案放到本地
Path localPath2 = new Path("file:///E:\\bigdata");
Path hdfsPath2 = new Path("/tmp/logs/root/logs/application_1619179579492_0001/node3_39678");
fs.copyToLocalFile(hdfsPath2,localPath2);
}

上傳成功!!!

下載也成功!!!
合并上傳
這種功能會把小檔案合并為一個大檔案進行存盤:

執行代碼:
//合并上傳小檔案
@Test
public void nergeFile() throws IOException {
//打開要合并的所有檔案,構建輸入流
LocalFileSystem local = FileSystem.getLocal(new Configuration());
//構建一個HDFS輸出流,生成檔案
FSDataOutputStream outputStream = fs.create(new Path("/bigdata/merge.txt"));
//遍歷檔案iter
FileStatus[] fileStatuses = local.listStatus(new Path("E:\\bigdata\\merge"));
for (FileStatus fileStatus : fileStatuses) {
//打開每個檔案并創建輸入流
FSDataInputStream inputStream = local.open(fileStatus.getPath());
//將輸入流的資料放入輸出流
IOUtils.copyBytes(inputStream,outputStream,4096);
//關閉輸入流
inputStream.close();
}
//回圈結束,關閉輸出流
outputStream.close();
//關閉檔案系統
local.close();
}
之后,可以看到:

內容被合并!!!
權限
HDFS默認開啟了權限,但是之前使用:
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
禁用了權限管理,
fs = FileSystem.get(new URI("hdfs://node1:8020"),conf,"root")
這種方式就是冒充root用戶進行操作的,,,
集群關機
為了避免之前的故障,不使用掛起了,,,使用關機貌似更安全,,,
關閉HDFS
node1使用:
stop-dfs.sh
關閉YARN
node1使用:
stop-yarn.sh
關閉HDFS時可能已經關閉了YARN,為了確保萬無一失,還是再使用一次,并使用jps查看行程確保安全,
關閉ZooKeeper
這一步其實可以不用做,,,∵每次開機都要啟動它,,,
cd /export/server/zookeeper-3.4.6/
bin/zkServer.sh stop
斷電
3臺機統一:
poweroff
轉載請註明出處,本文鏈接:https://www.uj5u.com/qita/280635.html
標籤:其他
上一篇:OkHttp 處理Https問題
下一篇:Golang-記憶體泄漏例子
