Hadoop Archives Guide (HDFS檔案歸檔)
一.簡介:
Hadoop Archives 是特殊的歸檔格式,一個 Hadoop archives對應一個檔案系統目錄,
hadoop Archives 的擴展名是*.har,
Hadoop Archives 包含元資料(形式是_index和_masterindex)和資料(part-*)檔案,
index 檔案包含了歸檔檔案的檔案名和位置資訊,
二.應用場景:
HDFS可能保存大量小檔案,NameNode占用大量記憶體,需要把小檔案合并,
hdfs中可能保存大量小檔案(當然不產生小檔案是最佳實踐),這樣會把namenode 的namespace搞的很大,namespace保存著hdfs檔案的inode資訊,檔案越多需要的namenode記憶體越大,但記憶體畢竟是有限的(這個是目前hadoop的硬傷),
三.下面圖片展示了,har檔案的結構,har檔案是通過mapreduce生成的,job結束后源檔案不會洗掉,
hdfs并不擅長存盤小檔案,因為每個檔案最少占用一個block,每個block的元資料都會在namenode節點占用記憶體,如果存在這樣大量的小檔案,它們會吃掉namenode節點的大量記憶體,
hadoop Archives可以有效的處理以上問題,他可以把多個檔案歸檔成為一個檔案,歸檔成一個檔案后還可以透明的訪問每一個檔案,并且可以做為mapreduce任務的輸入

四.優缺點分析
Hadoop archive 唯一的優勢可能就是將眾多的小檔案打包成一個har 檔案了,那這個檔案就會按照dfs.block.size 的大小進行分塊,因為hdfs為每個塊的元資料大小大約為150個位元組,如果眾多小檔案的存在(什么是小檔案內,就是小于dfs.block.size 大小的檔案,這樣每個檔案就是一個block)占用大量的namenode 堆記憶體空間,打成har 檔案可以大大降低namenode 守護節點的記憶體壓力,但對于MapReduce 來說起不到任何作用,因為har檔案就相當一個目錄,仍然不能講小檔案合并到一個split中去,一個小檔案一個split ,任然是低效的,這里要說一點<<hadoop 權威指南 中文版>>對這個翻譯有問題,上面說可以分配到一個split中去,但是低效的,
五.洗掉與恢復:
hdfs檔案被歸檔后,系統不會自動洗掉源檔案,需要手動洗掉,
hadoop fs -rmr /user/hadoop/xxx/201310/*.*.* 正則運算式來洗掉的,大家根據自己的需求洗掉原始檔案
有人說了,我刪了,歸檔檔案存在,源檔案不在了,如果要恢復怎么辦,其實這也很簡單,直接從har 檔案中 cp出來就可以了,
hadoop fs -cp /user/xxx/201310/201310.har/* /user/hadoop/xxx/201310/
六.如何創建:
英文:
Usage: hadoop archive -archiveName name -p <parent> <src>* <dest>
-archiveName is the name of the archive you would like to create. An example would be foo.har. The name should have a *.har extension. The parent argument is to specify the relative path to which the files should be archived to. Example would be :
-p /foo/bar a/b/c e/f/g
Here /foo/bar is the parent path and a/b/c, e/f/g are relative paths to parent. Note that this is a Map/Reduce job that creates the archives. You would need a map reduce cluster to run this. For a detailed example the later sections.
If you just want to archive a single directory /foo/bar then you can just use
hadoop archive -archiveName zoo.har -p /foo/bar /outputdir
譯文:由-archiveName選項指定你要創建的archive的名字,比如foo.har,archive的名字的擴展名應該是*.har,輸入是檔案系統的路徑名,路徑名的格式和平時的表達方式一樣,創建的archive會保存到目標目錄下,注意創建archives是一個Map/Reduce job,你應該在map reduce集群上運行這個命令,下面是一個例子:
hadoop archive -archiveName test_save_foo.har -p /foo/bar a/b/c e/f/g /user/outputdir/
以上是將/foo/bar檔案夾下面的a/b/c和e/f/g兩個目錄的內容壓縮歸檔到/user/outputdir/檔案夾下,并且源檔案不會被更改或者洗掉,注意,路徑a/b/c 和e/f/g都是/foo/bar 的子檔案夾
以下寫法是錯誤的,
hadoop archive -archiveName test_save_foo.har -p /foo/bar/a/b/c /foo/bar/e/f/g /user/outputdir/
報錯如下:
source path /foo/bar/a/b/c is not relative to /foo/bar/e/f/g
生成HAR檔案:
har命令說明
引數“-p”為src path的前綴,src可以寫多個path
archive -archiveName NAME -p <parent path> <src>* <dest>
1)、單個src檔案夾
hadoop archive -archiveName test_save_foo.har -p /foo/bar/ 419 /user/outputdir/
2)、多個src檔案夾
hadoop archive -archiveName test_save_foo.har -p /foo/bar/ 419 510 /user/outputdir/
3)、不指定src path,直接歸檔parent path(本例為“ /foo/bar/20120116/ ”, “ /user/outputdir ”仍然為輸出path),這招是從原始碼里翻出來的,
hadoop archive -archiveName test_save_foo.har -p /foo/bar/ /user/outputdir/
4)、 使用模式匹配的src path,下面的示例歸檔10、11、12月檔案夾的資料,這招也是從原始碼發出來的,
hadoop archive -archiveName combine.har -p /foo/bar/2011 1[0-2] /user/outputdir/
七、如何查看
英文:
The archive exposes itself as a file system layer. So all the fs shell commands in the archives work but with a different URI. Also, note that archives are immutable. So, rename's, deletes and creates return an error. URI for Hadoop Archives is
har://scheme-hostname:port/archivepath/fileinarchive
If no scheme is provided it assumes the underlying filesystem. In that case the URI would look like
har:///archivepath/fileinarchive
譯文:
archive作為檔案系統層暴露給外界,所以所有的fs shell命令都能在archive上運行,但是要使用不同的URI, 另外,archive是不可改變的,所以重命名,洗掉和創建都會回傳錯誤,Hadoop Archives 的URI是
har://scheme-hostname:port/archivepath/fileinarchive
如果沒提供scheme-hostname,它會使用默認的檔案系統,這種情況下URI是這種形式
har:///archivepath/fileinarchive
這是一個archive的例子,archive的輸入是/dir,這個dir目錄包含檔案filea,fileb, 把/dir歸檔到/user/hadoop/foo.bar的命令是
hadoop archive -archiveName foo.har /dir /user/hadoop
獲得創建的archive中的檔案串列,使用命令
hadoop dfs -lsr har:///user/hadoop/foo.har
查看archive中的filea檔案的命令-
hadoop dfs -cat har:///user/hadoop/foo.har/dir/filea
八、如何在MapReduce程式中使用Hadoop Archives(歸檔)檔案
英文:Using Hadoop Archives in MapReduce is as easy as specifying a different input filesystem than the default file system. If you have a hadoop archive stored in HDFS in /user/zoo/foo.har then for using this archive for MapReduce input, all you need to specify the input directory as har:///user/zoo/foo.har. Since Hadoop Archives is exposed as a file system MapReduce will be able to use all the logical input files in Hadoop Archives as input.
譯文:在MapReduce中,與輸入資料 使用默認檔案系統一樣,也可以使用Hadoop Archives(歸檔)檔案作為輸入檔案系統,如果你有存盤在HDFS目錄下/user/zoo/foo.har的Hadoop Archives(歸檔)檔案 ,然后你在MapReduce程式中就可以使用如下路徑har:///user/zoo/foo.har作為輸入檔案,
由于Hadoop Archives(歸檔)檔案是作為一種檔案型別,MapReduce將能夠使用Hadoop Archives(歸檔)檔案中的所有邏輯輸入檔案作為輸入源,
九、給出示例
1)原文示例:
英文:
@1)Creating an Archive
hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo
The above example is creating an archive using /user/hadoop as the relative archive directory. The directories /user/hadoop/dir1 and /user/hadoop/dir2 will be archived in the following file system directory -- /user/zoo/foo.har. Archiving does not delete the input files. If you want to delete the input files after creating the archives (to reduce namespace), you will have to do it on your own.
@2)Looking Up Files
Looking up files in hadoop archives is as easy as doing an ls on the filesystem. After you have archived the directories /user/hadoop/dir1 and /user/hadoop/dir2 as in the example above, to see all the files in the archives you can just run:
hadoop dfs -lsr har:///user/zoo/foo.har/
To understand the significance of the -p argument, lets go through the above example again. If you just do an ls (not lsr) on the hadoop archive using
hadoop dfs -ls har:///user/zoo/foo.har
The output should be:
har:///user/zoo/foo.har/dir1 har:///user/zoo/foo.har/dir2
As you can recall the archives were created with the following command
hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo
If we were to change the command to:
hadoop archive -archiveName foo.har -p /user/ hadoop/dir1 hadoop/dir2 /user/zoo
then a ls on the hadoop archive using
hadoop dfs -ls har:///user/zoo/foo.har
would give you
har:///user/zoo/foo.har/hadoop/dir1 har:///user/zoo/foo.har/hadoop/dir2
Notice that the archived files have been archived relative to /user/ rather than /user/hadoop
十. 實際操作
(1)開始歸檔
[pluto@hadoop01 tools]$ hadoop archive -archiveName /hartest1.har -p /hartext /hartext1
21/03/18 21:19:56 INFO mapreduce.JobSubmitter: number of splits:1
21/03/18 21:19:56 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1616033470495_0002
21/03/18 21:19:57 INFO impl.YarnClientImpl: Submitted application application_1616033470495_0002
21/03/18 21:19:57 INFO mapreduce.Job: The url to track the job: http://hadoop01:8088/proxy/application_1616033470495_0002/
21/03/18 21:19:57 INFO mapreduce.Job: Running job: job_1616033470495_0002
21/03/18 21:20:34 INFO mapreduce.Job: Job job_1616033470495_0002 running in uber mode : false
21/03/18 21:20:34 INFO mapreduce.Job: map 0% reduce 0%
21/03/18 21:21:10 INFO mapreduce.Job: map 100% reduce 0%
21/03/18 21:21:23 INFO mapreduce.Job: map 100% reduce 100%
21/03/18 21:21:24 INFO mapreduce.Job: Job job_1616033470495_0002 completed successfully
21/03/18 21:21:24 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=280
FILE: Number of bytes written=244529
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=453
HDFS: Number of bytes written=288
HDFS: Number of read operations=19
HDFS: Number of large read operations=0
HDFS: Number of write operations=7
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Other local map tasks=1
Total time spent by all maps in occupied slots (ms)=33564
Total time spent by all reduces in occupied slots (ms)=20786
Total time spent by all map tasks (ms)=33564
Total time spent by all reduce tasks (ms)=10393
Total vcore-seconds taken by all map tasks=33564
Total vcore-seconds taken by all reduce tasks=10393
Total megabyte-seconds taken by all map tasks=51554304
Total megabyte-seconds taken by all reduce tasks=31927296
Map-Reduce Framework
Map input records=4
Map output records=4
Map output bytes=266
Map output materialized bytes=280
Input split bytes=117
Combine input records=0
Combine output records=0
Reduce input groups=4
Reduce shuffle bytes=280
Reduce input records=4
Reduce output records=0
Spilled Records=8
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=1112
CPU time spent (ms)=4740
Physical memory (bytes) snapshot=461152256
Virtual memory (bytes) snapshot=7755964416
Total committed heap usage (bytes)=343932928
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=321
File Output Format Counters
Bytes Written=0
(2)查看壓縮檔案的組成結構:
[pluto@hadoop01 tools]$ hdfs dfs -ls /hartest1.har
Found 4 items
-rw-r--r-- 3 pluto supergroup 0 2021-03-18 21:21 /hartest1.har/_SUCCESS
-rw-r--r-- 5 pluto supergroup 250 2021-03-18 21:21 /hartest1.har/_index
-rw-r--r-- 5 pluto supergroup 23 2021-03-18 21:21 /hartest1.har/_masterindex
-rw-r--r-- 3 pluto supergroup 15 2021-03-18 21:21 /hartest1.har/part-0
[pluto@hadoop01 tools]$ hdfs dfs -cat /hartest1.har/part-0
dd
dsd
das
asf
(3)使用hdfs檔案系統查看har檔案目錄內容
[pluto@hadoop01 tools]$ hadoop dfs -ls har:///hartest1.har/*
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
21/03/18 21:47:14 WARN hdfs.DFSClient: DFSInputStream has been closed already
-rw-r--r-- 3 pluto supergroup 7 2021-03-18 21:16 har:///hartest1.har/a.txt
-rw-r--r-- 3 pluto supergroup 4 2021-03-18 21:16 har:///hartest1.har/b.txt
-rw-r--r-- 3 pluto supergroup 4 2021-03-18 21:16 har:///hartest1.har/c.txt
(4)使用hdfs檔案系統查看har檔案具體的內容
[pluto@hadoop01 tools]$ hadoop dfs -cat har:///hartest1.har/*
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
21/03/18 21:48:12 WARN hdfs.DFSClient: DFSInputStream has been closed already
dd
dsd
das
asf
(5)Hadoop Archive解檔
既然歸檔了就需要有解檔的操作,可以使用hadoop distcp命令完成,具體操作如下:
[pluto@hadoop01 tools]$ hadoop distcp har:/hartest1.har /hartext
har:/hartest1:.har har的檔案位置
/hartext : 到哪里去
21/03/18 22:06:54 INFO tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=false, deleteMissing=false, ignoreFailures=false, maxMaps=20, sslConfigurationFile='null', copyStrategy='uniformsize', sourceFileListing=null, sourcePaths=[har:/hartest1.har], targetPath=/hartext, targetPathExists=true, preserveRawXattrs=false}
21/03/18 22:06:54 WARN hdfs.DFSClient: DFSInputStream has been closed already
21/03/18 22:06:55 INFO Configuration.deprecation: io.sort.mb is deprecated. Instead, use mapreduce.task.io.sort.mb
21/03/18 22:06:55 INFO Configuration.deprecation: io.sort.factor is deprecated. Instead, use mapreduce.task.io.sort.factor
21/03/18 22:06:55 INFO mapreduce.JobSubmitter: number of splits:4
21/03/18 22:06:55 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1616033470495_0003
21/03/18 22:06:55 INFO impl.YarnClientImpl: Submitted application application_1616033470495_0003
21/03/18 22:06:55 INFO mapreduce.Job: The url to track the job: http://hadoop01:8088/proxy/application_1616033470495_0003/
21/03/18 22:06:55 INFO tools.DistCp: DistCp job-id: job_1616033470495_0003
21/03/18 22:06:55 INFO mapreduce.Job: Running job: job_1616033470495_0003
21/03/18 22:07:41 INFO mapreduce.Job: Job job_1616033470495_0003 running in uber mode : false
21/03/18 22:07:41 INFO mapreduce.Job: map 0% reduce 0%
21/03/18 22:08:34 INFO mapreduce.Job: map 25% reduce 0%
21/03/18 22:08:37 INFO mapreduce.Job: map 50% reduce 0%
21/03/18 22:08:39 INFO mapreduce.Job: map 100% reduce 0%
21/03/18 22:08:39 INFO mapreduce.Job: Job job_1616033470495_0003 completed successfully
21/03/18 22:08:40 INFO mapreduce.Job: Counters: 33
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=492708
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=2999
HDFS: Number of bytes written=15
HDFS: Number of read operations=124
HDFS: Number of large read operations=0
HDFS: Number of write operations=16
Job Counters
Launched map tasks=4
Other local map tasks=4
Total time spent by all maps in occupied slots (ms)=215959
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=215959
Total vcore-seconds taken by all map tasks=215959
Total megabyte-seconds taken by all map tasks=221142016
Map-Reduce Framework
Map input records=4
Map output records=0
Input split bytes=540
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=3577
CPU time spent (ms)=3350
Physical memory (bytes) snapshot=760733696
Virtual memory (bytes) snapshot=12149698560
Total committed heap usage (bytes)=597164032
File Input Format Counters
Bytes Read=1352
File Output Format Counters
Bytes Written=0
org.apache.hadoop.tools.mapred.CopyMapper$Counter
BYTESCOPIED=15
BYTESEXPECTED=15
COPY=4
[pluto@hadoop01 tools]$ hdfs dfs -ls /hartext
Found 1 items
drwxr-xr-x - pluto supergroup 0 2021-03-18 22:08 /hartext/hartest1.har
[pluto@hadoop01 tools]$ hdfs dfs -ls /hartext/hartest1.har
Found 3 items
-rw-r--r-- 3 pluto supergroup 7 2021-03-18 22:08 /hartext/hartest1.har/a.txt
-rw-r--r-- 3 pluto supergroup 4 2021-03-18 22:08 /hartext/hartest1.har/b.txt
-rw-r--r-- 3 pluto supergroup 4 2021-03-18 22:08 /hartext/hartest1.har/c.txt
參考:https://blog.csdn.net/helloxiaozhe/article/details/79159799
轉載請註明出處,本文鏈接:https://www.uj5u.com/qita/271855.html
標籤:其他
