Hive，如何按空值列進行磁區，將所有空值放在一個磁區中-有解無憂

我使用的是 Hive，IDE 是 Hue。我正在嘗試為我的磁區鍵選擇不同的組合鍵。

我的原表定義如下：

CREATE External Table `my_hive_db`.`my_table`(
    `col_id` bigint,
    `result_section__col2` string,
    `result_section_col3` string ,
    `result_section_col4` string,
    `result_section_col5` string,
    `result_section_col6__label` string,
    `result_section_col7__label_id` bigint ,
    `result_section_text` string ,
    `result_section_unit` string,
    `result_section_col` string ,
    `result_section_title` string,
    `result_section_title_id` bigint,
    `col13` string,
    `timestamp` bigint,
    `date_day` string
    )
    PARTITIONED BY ( 
      `date_year` string, 
      `date_month` string)
    ROW FORMAT SERDE 
      'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
    STORED AS INPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
    OUTPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
    LOCATION
      's3a://some/where/in/amazon/s3';

上面的代碼作業正常。但是當我用 date_day 作為磁區鍵創建一個新表時，該表是空的，我需要運行 MSCK 修復表。但是，我收到以下錯誤：

編譯陳述句時出錯：失敗：執行錯誤，從 org.apache.hadoop.hive.ql.ddl.DDLTask 回傳代碼 1

Hive，如何按空值列進行磁區，將所有空值放在一個磁區中

當磁區鍵為 date_year、date_month 時，MSCK 正常作業。

我收到錯誤的表的表定義如下：

CREATE External Table `my_hive_db`.`my_table`(
    `col_id` bigint,
    `result_section__col2` string,
    `result_section_col3` string ,
    `result_section_col4` string,
    `result_section_col5` string,
    `result_section_col6__label` string,
    `result_section_col7__label_id` bigint ,
    `result_section_text` string ,
    `result_section_unit` string,
    `result_section_col` string ,
    `result_section_title` string,
    `result_section_title_id` bigint,
    `col13` string,
    `timestamp` bigint,
    `date_year` string, 
    `date_month` string
  )
    PARTITIONED BY (
     `date_day` string)
    ROW FORMAT SERDE 
      'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
    STORED AS INPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
    OUTPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
    LOCATION
      's3a://some/where/in/amazon/s3';

在此之后，以下查詢為空：

Select * From `my_hive_db`.`my_table` Limit 10;

因此，我運行了以下命令：

MSCK REPAIR TABLE `my_hive_db`.`my_table`;

And I get the error: Error while compiling statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.ddl.DDLTask

I checked this link as it is exactly the error I am getting, but by using the solution provided:

set hive.msck.path.validation=ignore;
MSCK REPAIR TABLE table_name;

I get a different error:

Error while processing statement: Cannot modify hive.msck.path.validation at runtime. It is not in list of params that are allowed to be modified at runtime.

I think the reason I am getting these errors is that there are more than 200 million records with date_day having null value.

There are 31 distinct date-day not null values. I would like to partition my table with 32 partitions, each for a distinct value of date_day field, and all the null values get into a different partition. Is there a way to do so (partitioning by a column with null values)?

If this can be achieved by spark, I am also open to use it.

This is part of a bigger problem of changing partition keys by recreating a table as mentioned in this link in answer to my other question.

Thank you for your help.

uj5u.com熱心網友回復：

你似乎不明白 Hive 的磁區是如何作業的。Hive 將資料存盤到 HDFS（或 S3，或其他一些分布式檔案夾）上的檔案中。如果您創建一個名為的非磁區鑲木地板表my_schema.my_table，您將看到在您的分布式存盤檔案夾中存盤的檔案

hive/warehouse/my_schema.db/my_table/part_00001.parquet
hive/warehouse/my_schema.db/my_table/part_00002.parquet
...

如果創建按列磁區的表p_col，檔案將如下所示

hive/warehouse/my_schema.db/my_table/p_col=value1/part_00001.parquet
hive/warehouse/my_schema.db/my_table/p_col=value1/part_00002.parquet
...
hive/warehouse/my_schema.db/my_table/p_col=value2/part_00001.parquet
hive/warehouse/my_schema.db/my_table/p_col=value2/part_00002.parquet
...

該命令MSCK repair table允許您在創建外部表時自動重新加載磁區。

假設您在 s3 上有如下所示的檔案夾：

hive/warehouse/my_schema.db/my_table/p_col=value1/part_00001.parquet
hive/warehouse/my_schema.db/my_table/p_col=value2/part_00001.parquet
hive/warehouse/my_schema.db/my_table/p_col=value3/part_00001.parquet

您創建一個外部表

CREATE External Table my_schema.my_table(
   ... some columns ...
)
PARTITIONED BY (p_col STRING)

該表將被創建但為空，因為 Hive 尚未檢測到磁區。您運行MSCK REPAIR TABLE my_schema.my_table，Hive 將識別您的磁區p_col與 s3 ( /p_col=value1/)上的磁區方案匹配。

根據我從您的另一個問題中了解到的，您正在嘗試通過執行以下操作來更改表的磁區方案

CREATE External Table my_schema.my_table(
   ... some columns ...
)
PARTITIONED BY (p_another_col STRING)

并且您收到一條錯誤訊息，因為p_another_col與 s3 中使用的列不匹配，即p_col. 這個錯誤是完全正常的，因為你所做的沒有意義。

如另一個問題的答案所述，您需要使用不同的磁區方案創建第一個表的副本。

你應該嘗試這樣的事情：

CREATE External Table my_hive_db.my_table_2(
    `col_id` bigint,
    `result_section__col2` string,
    `result_section_col3` string ,
    `result_section_col4` string,
    `result_section_col5` string,
    `result_section_col6__label` string,
    `result_section_col7__label_id` bigint ,
    `result_section_text` string ,
    `result_section_unit` string,
    `result_section_col` string ,
    `result_section_title` string,
    `result_section_title_id` bigint,
    `col13` string,
    `timestamp` bigint,
    `date_year` string, 
    `date_month` string
)
PARTITIONED BY (`date_day` string)

然后使用動態磁區填充新表

INSERT OVERWRITE TABLE my_hive_db.my_table_2 PARTITION(date_day)
SELECT 
  col_id,
  result_section__col2,
  result_section_col3,
  result_section_col4,
  result_section_col5,
  result_section_col6__label,
  result_section_col7__label_id,
  result_section_text,
  result_section_unit,
  result_section_col,
  result_section_title,
  result_section_title_id,
  col13,
  timestamp,
  date_year,
  date_month,
  date_day
FROM my_hive_db.my_table_1

轉載請註明出處，本文鏈接：https://www.uj5u.com/gongcheng/373060.html

標籤：apache-spark hadoop hive database-partitioning

上一篇：從JAR內部讀取Avroparquet檔案

下一篇：PythonMapReduce長度最長單詞并顯示最長單詞