我正在嘗試使用 AWS Glue 將 20GB JSON gzip 檔案轉換為鑲木地板。
我已經使用 Pyspark 和下面的代碼設定了一個作業。
我收到此日志警告訊息:
LOG.WARN: Loading one large unsplittable file s3://aws-glue-data.json.gz with only one partition, because the file is compressed by unsplittable compression codec.
我想知道是否有辦法拆分/分塊檔案?我知道我可以用 pandas 做到這一點,但不幸的是,這需要很長時間(12 多個小時)。
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
import pyspark.sql.functions
from pyspark.sql.functions import col, concat, reverse, translate
from awsglue.context import GlueContext
from awsglue.job import Job
glueContext = GlueContext(SparkContext.getOrCreate())
test = glueContext.create_dynamic_frame_from_catalog(
database="test_db",
table_name="aws-glue-test_table")
# Create Spark DataFrame, remove timestamp field and re-name other fields
reconfigure = test.drop_fields(['timestamp']).rename_field('name', 'FirstName').rename_field('LName', 'LastName').rename_field('type', 'record_type')
# Create pyspark DF
spark_df = reconfigure.toDF()
# Filter and only return 'a' record types
spark_df = spark_df.where("record_type == 'a'")
# Once filtered, remove the record_type column
spark_df = spark_df.drop('record_type')
spark_df = spark_df.withColumn("LastName", translate("LastName", "LName:", ""))
spark_df = spark_df.withColumn("FirstName", reverse("FirstName"))
spark_df.write.parquet("s3a://aws-glue-bucket/parquet/test.parquet")
uj5u.com熱心網友回復:
Spark 不會并行讀取單個 gzip 檔案。但是,您可以將其拆分為塊。
此外,Spark 讀取 gzip 檔案的速度非常慢(因為它沒有并行化)。您可以這樣做來加快速度:
file_names_rdd = sc.parallelize(list_of_files, 100)
lines_rdd = file_names_rdd.flatMap(lambda _: gzip.open(_).readlines())
轉載請註明出處,本文鏈接:https://www.uj5u.com/caozuo/419816.html
標籤:
