我的 HDFS 系統上有一個類似的目錄結構:
/some/path
├─ report01
│ ├─ file01.csv
│ ├─ file02.csv
│ ├─ file03.csv
│ └─ lot_of_other_non_csv_files
├─ report02
│ ├─ file01.csv
│ ├─ file02.csv
│ ├─ file03.csv
│ ├─ file04.csv
│ └─ lot_of_other_non_csv_files
└─ report03
├─ file01.csv
├─ file02.csv
└─ lot_of_other_non_csv_files
我想將所有 CSV 檔案復制到我的本地系統,同時保持目錄結構。
我試過hdfs dfs -copyToLocal /some/path/report*了,但是這種方法復制了很多我不想得到的不必要的(而且相當大的)檔案。
我也嘗試過hdfs dfs -copyToLocal /some/path/report*/file*.csv,但這不會保留目錄結構,并且 HDFS 在嘗試從檔案夾中復制檔案時抱怨檔案已經存在report02。
有沒有辦法只獲取與特定模式匹配的檔案,同時仍保持原始目錄結構?
uj5u.com熱心網友回復:
由于似乎沒有任何直接在 Hadoop 中實作的解決方案,我最終創建了自己的 bash 腳本:
#!/bin/bash
# pattern of files to get
TO_GET=("*.csv$" "*.png$")
# pattern of files/directories to avoid
TO_AVOID=("*_temporary*")
# function to join an array by a specified separator:
# usage: join_arr ";" ${array[@]}
join_arr() {
local IFS="$1"
shift
echo "$*"
}
if (($# != 2))
then
echo "There should be two parameters (path of the directory to get and destination)."
else
# ensure that the provided path ends with a slash
[[ "$1" != */ ]] && path="$1/" || path="$1"
echo "Path to copy: $path"
# ensure that the provided destination ends with a slash and append result directory name
[[ "$2" != */ ]] && dest="$2/" || dest="$2"
dest="$dest$(basename $path)/"
echo "Destination: $dest"
# get name of all files matching the patterns
echo -n "Exploring path to find matching files... "
readarray -t files < <(hdfs dfs -ls -R "$path" | egrep -v "$(join_arr "|" "${TO_AVOID[@]}")" | egrep "$(join_arr "|" "${TO_GET[@]}")" | awk '{print $NF}' | cut -c $((${#path} 1))-)
echo "Done!"
# check if at least one file found
[ -z "$files" ] && echo "No files matching the patern."
# get files one by one
for file in ${files[@]}
do
path_and_file="$path$file"
dest_and_file="$dest$file"
# make sure the directory exist on the local file system
mkdir -p "$(dirname "$dest_and_file")"
# get file in a separate process to be able to execute the queries in parallel
(hdfs dfs -copyToLocal -f "$path_and_file" "$dest_and_file" && echo "$file") &
done
# wait for all queries to be finished
wait
fi
您可以這樣呼叫腳本:
$ script.sh "/some/hdfs/path/folder_to_get" "/some/local/path"
該腳本將根據目錄結構創建一個包含所有 CSV 和 PNG 檔案的目錄folder_to_get。/some/local/path
注意:如果要獲取除 CSV 和 PNG 以外的其他檔案,只需修改TO_GET腳本頂部的變數即可。您還可以修改TO_AVOID變數以過濾您不想掃描的目錄,即使它們包含 CSV 或 PNG 檔案。
轉載請註明出處,本文鏈接:https://www.uj5u.com/shujuku/511167.html
標籤:Hadoop高清晰度电视
上一篇:減少mapreduce輸出檔案
下一篇:運行hdfsdfs命令時出錯
