我想要的是

我在這種目錄中有 17TB 的日期磁區資料：

/data_folder
  /date=2021.01.01
    /part-00002-f0b91523-6e0c-4adc-88cc-e9451614791d.c000.snappy.parquet
    /part-00002-f0193442-c20e-49d2-bde1-70053ae2a254.c000.snappy.parquet
    /... over 9000 part files 
  /date=2021.01.02
    /part-00002-bdb50c33-fd32-4e87-9edb-cec77973760b.c000.snappy.parquet
    /part-00001-e2cd906e-5669-46d7-92e9-7498ed60487f.c000.snappy.parquet
    /... over 9000 part files

我想讓它看起來像這樣：

/data_folder
  /date=2021.01.01
    /merge.parquet
  /date=2021.01.02
    /merge.parquet

我想要這個是因為我聽說 HDFS 更適合存盤少量的大檔案，而不是大量的小檔案。現在我的查詢變得很慢。希望這個優化能加快他們的速度

我所做的

所以我運行這樣的命令：

hdfs dfs -getmerge /data_folder/date=2021.01.01 merge.parquet;
hdfs dfs -copyFromLocal -f -t 4 merge.parquet /merged/date=2021.01.01/merge.parquet;

我得到了我想要的目錄結構，但現在我無法讀取檔案。查詢：

%spark2.spark

val date = "2021.01.01"


val ofdCheques2Uniq = spark.read
    .parquet(s"/projects/khajiit/data/OfdCheques2/date=$date")
    .withColumn("chequeId", concat($"content.cashboxRegNumber", lit("_"), $"content.number", lit("_"), col("content.timestamp")))
    .dropDuplicates("chequeId")
    
val ofdChequesTempUniq = spark.read
    .parquet(s"/projects/khajiit/data/OfdChequesTemp/date=$date")
    .withColumn("chequeId", concat($"content.cashboxRegNumber", lit("_"), $"content.number", lit("_"), col("content.timestamp")))
    .dropDuplicates("chequeId")

println(s"OfdCheques2   : ${ofdCheques2Uniq.count} unique cheques")
println(s"OfdChequesTemp: ${ofdChequesTempUniq.count} unique cheques")

印刷：

OfdCheques2   : 4309 unique cheques
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 74.0 failed 4 times, most recent failure: Lost task 0.3 in stage 74.0 (TID 1720, srs-st-hdp-s3.dev.kontur.ru, executor 1): java.io.IOException: can not read class org.apache.parquet.format.PageHeader: don't know what type: 13

同時，這樣的查詢：

val ofdCheques2Uniq = spark.read
    .parquet(s"/projects/khajiit/data/OfdCheques2/date=$date")
    
val ofdChequesTempUniq = spark.read
    .parquet(s"/projects/khajiit/data/OfdChequesTemp/date=$date")

println(s"OfdCheques2   : ${ofdCheques2Uniq.count} unique cheques")
println(s"OfdChequesTemp: ${ofdChequesTempUniq.count} unique cheques")

印刷：

OfdCheques2   : 5290 unique cheques
OfdChequesTemp: 18 unique cheques

最后是問題

該getmerge命令是否適用于我的問題？如果是這樣，我做錯了什么？
解決這個問題的最佳方法是什么？

uj5u.com熱心網友回復：

得到了我想要的目錄結構，但現在我無法讀取檔案

這是由于 Parquet 檔案的二進制結構。它們具有存盤模式和檔案中記錄數的頁眉/頁腳元資料......getmerge因此實際上僅對行分隔的非二進制資料格式有用。

您可以做的是擁有spark.read.path("/data_folder"), thenrepartition或coalesce那個資料幀，然后輸出到新的“合并”輸出位置

另一種選擇是 Gobblin - https://gobblin.apache.org/docs/user-guide/Compaction/

轉載請註明出處，本文鏈接：https://www.uj5u.com/ruanti/343955.html

標籤：阿帕奇火花 Hadoop 高密度文件镶木地板

上一篇：Express正在將HTML物體應用于JSON正文

下一篇：在預期決議從S3加載的雪花資料中的列時達到記錄結束

如何在HDFS中合并部分檔案？

我想要的是

我所做的

最后是問題