我使用的是 Hive,IDE 是 Hue。我正在嘗試為我的磁區鍵選擇不同的組合鍵。
我的原表定義如下:
CREATE External Table `my_hive_db`.`my_table`(
`col_id` bigint,
`result_section__col2` string,
`result_section_col3` string ,
`result_section_col4` string,
`result_section_col5` string,
`result_section_col6__label` string,
`result_section_col7__label_id` bigint ,
`result_section_text` string ,
`result_section_unit` string,
`result_section_col` string ,
`result_section_title` string,
`result_section_title_id` bigint,
`col13` string,
`timestamp` bigint,
`date_day` string
)
PARTITIONED BY (
`date_year` string,
`date_month` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3a://some/where/in/amazon/s3';
上面的代碼作業正常。但是當我用 date_day 作為磁區鍵創建一個新表時,該表是空的,我需要運行 MSCK 修復表。但是,我收到以下錯誤:
編譯陳述句時出錯:失敗:執行錯誤,從 org.apache.hadoop.hive.ql.ddl.DDLTask 回傳代碼 1

當磁區鍵為 date_year、date_month 時,MSCK 正常作業。
我收到錯誤的表的表定義如下:
CREATE External Table `my_hive_db`.`my_table`(
`col_id` bigint,
`result_section__col2` string,
`result_section_col3` string ,
`result_section_col4` string,
`result_section_col5` string,
`result_section_col6__label` string,
`result_section_col7__label_id` bigint ,
`result_section_text` string ,
`result_section_unit` string,
`result_section_col` string ,
`result_section_title` string,
`result_section_title_id` bigint,
`col13` string,
`timestamp` bigint,
`date_year` string,
`date_month` string
)
PARTITIONED BY (
`date_day` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3a://some/where/in/amazon/s3';
在此之后,以下查詢為空:
Select * From `my_hive_db`.`my_table` Limit 10;
因此,我運行了以下命令:
MSCK REPAIR TABLE `my_hive_db`.`my_table`;
And I get the error: Error while compiling statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.ddl.DDLTask
I checked this link as it is exactly the error I am getting, but by using the solution provided:
set hive.msck.path.validation=ignore;
MSCK REPAIR TABLE table_name;
I get a different error:
Error while processing statement: Cannot modify hive.msck.path.validation at runtime. It is not in list of params that are allowed to be modified at runtime.
I think the reason I am getting these errors is that there are more than 200 million records with date_day having null value.
There are 31 distinct date-day not null values. I would like to partition my table with 32 partitions, each for a distinct value of date_day field, and all the null values get into a different partition. Is there a way to do so (partitioning by a column with null values)?
If this can be achieved by spark, I am also open to use it.
This is part of a bigger problem of changing partition keys by recreating a table as mentioned in this link in answer to my other question.
Thank you for your help.
uj5u.com熱心網友回復:
你似乎不明白 Hive 的磁區是如何作業的。Hive 將資料存盤到 HDFS(或 S3,或其他一些分布式檔案夾)上的檔案中。如果您創建一個名為 的非磁區鑲木地板表my_schema.my_table,您將看到在您的分布式存盤檔案夾中存盤的檔案
hive/warehouse/my_schema.db/my_table/part_00001.parquet
hive/warehouse/my_schema.db/my_table/part_00002.parquet
...
如果創建按列磁區的表p_col,檔案將如下所示
hive/warehouse/my_schema.db/my_table/p_col=value1/part_00001.parquet
hive/warehouse/my_schema.db/my_table/p_col=value1/part_00002.parquet
...
hive/warehouse/my_schema.db/my_table/p_col=value2/part_00001.parquet
hive/warehouse/my_schema.db/my_table/p_col=value2/part_00002.parquet
...
該命令MSCK repair table允許您在創建外部表時自動重新加載磁區。
假設您在 s3 上有如下所示的檔案夾:
hive/warehouse/my_schema.db/my_table/p_col=value1/part_00001.parquet
hive/warehouse/my_schema.db/my_table/p_col=value2/part_00001.parquet
hive/warehouse/my_schema.db/my_table/p_col=value3/part_00001.parquet
您創建一個外部表
CREATE External Table my_schema.my_table(
... some columns ...
)
PARTITIONED BY (p_col STRING)
該表將被創建但為空,因為 Hive 尚未檢測到磁區。您運行MSCK REPAIR TABLE my_schema.my_table,Hive 將識別您的磁區p_col與 s3 ( /p_col=value1/)上的磁區方案匹配。
根據我從您的另一個問題中了解到的,您正在嘗試通過執行以下操作來更改表的磁區方案
CREATE External Table my_schema.my_table(
... some columns ...
)
PARTITIONED BY (p_another_col STRING)
并且您收到一條錯誤訊息,因為p_another_col與 s3 中使用的列不匹配,即p_col. 這個錯誤是完全正常的,因為你所做的沒有意義。
如另一個問題的答案所述,您需要使用不同的磁區方案創建第一個表的副本。
你應該嘗試這樣的事情:
CREATE External Table my_hive_db.my_table_2(
`col_id` bigint,
`result_section__col2` string,
`result_section_col3` string ,
`result_section_col4` string,
`result_section_col5` string,
`result_section_col6__label` string,
`result_section_col7__label_id` bigint ,
`result_section_text` string ,
`result_section_unit` string,
`result_section_col` string ,
`result_section_title` string,
`result_section_title_id` bigint,
`col13` string,
`timestamp` bigint,
`date_year` string,
`date_month` string
)
PARTITIONED BY (`date_day` string)
然后使用動態磁區填充新表
INSERT OVERWRITE TABLE my_hive_db.my_table_2 PARTITION(date_day)
SELECT
col_id,
result_section__col2,
result_section_col3,
result_section_col4,
result_section_col5,
result_section_col6__label,
result_section_col7__label_id,
result_section_text,
result_section_unit,
result_section_col,
result_section_title,
result_section_title_id,
col13,
timestamp,
date_year,
date_month,
date_day
FROM my_hive_db.my_table_1
轉載請註明出處,本文鏈接:https://www.uj5u.com/gongcheng/373060.html
