我有一個 JSON 檔案中的資料(只顯示第一行)
{
"cd_created_date": "2021-10-05T21:33:39.480933",
"cd_jurisdiction": "PROBATE",
"cd_last_modified": "2021-10-05T21:35:04.061105",
"cd_last_state_modified_date": "2021-10-05T21:35:04.060968",
"cd_latest_state": "WillWithdrawn",
"cd_reference": 1633469619443286,
"cd_security_classification": "PUBLIC",
"cd_version": 7,
"ce_case_data_id": 3483511,
"ce_case_type_id": "WillLodgement",
"ce_case_type_version": 170,
"ce_created_date": "2021-10-05T21:33:51.189872",
"ce_data": "{\"willDate\": \"1950-01-01\", \"jointWill\": \"Yes\", \"lodgedDate\": \"1970-03-03\", \"codicilDate\": \"1962-02-02\", \"executorTitle\": \"Mr\", \"lodgementType\": \"safeCustody\", \"deceasedGender\": \"male\", \"applicationType\": \"Personal\", \"deceasedAddress\": {\"County\": \"London\", \"Country\": \"United Kingdom\", \"PostCode\": \"SW1A 1AA\", \"PostTown\": \"London\", \"AddressLine1\": \"1\", \"AddressLine2\": \"Buckingham Palace\", \"AddressLine3\": \"The place to be\"}, \"deceasedSurname\": \"E2E_deceased_surname_1633469477956\", \"executorAddress\": {\"County\": \"London\", \"Country\": \"United Kingdom\", \"PostCode\": \"SW1A 1AA\", \"PostTown\": \"London\", \"AddressLine1\": \"1\", \"AddressLine2\": \"Buckingham Palace\", \"AddressLine3\": \"The place to be\"}, \"executorSurname\": \"executor1_surname\", \"numberOfCodicils\": \"3\", \"registryLocation\": \"Liverpool\", \"deceasedForenames\": \"E2E_deceased_forenames_1633469477956\", \"documentsUploaded\": [{\"id\": \"b1181bfb-d0a7-49d8-8301-b06e58eb42c1\", \"value\": {\"Comment\": \"test file to upload\", \"DocumentLink\": {\"document_url\": \"http://dm-store-aat.service.core-compute-aat.internal/documents/60cd0a78-648e-4af7-9f83-15a380f1786d\", \"document_filename\": \"test_file_for_document_upload.png\", \"document_binary_url\": \"http://dm-store-aat.service.core-compute-aat.internal/documents/60cd0a78-648e-4af7-9f83-15a380f1786d/binary\"}, \"DocumentType\": \"email\"}}], \"executorForenames\": \"executor1_forenames\", \"deceasedDateOfBirth\": \"1930-01-01\", \"deceasedDateOfDeath\": \"2017-01-01\", \"deceasedTypeOfDeath\": \"diedOnOrAbout\", \"deceasedEmailAddress\": \"[email protected]\", \"executorEmailAddress\": \"[email protected]\", \"deceasedAnyOtherNames\": \"Yes\", \"additionalExecutorList\": [{\"id\": \"bd8d7ca2-ed84-424c-a241-40fd99b15596\", \"value\": {\"executorTitle\": \"Dr\", \"executorAddress\": {\"County\": \"London\", \"Country\": \"United Kingdom\", \"PostCode\": \"SW1A 1AA\", \"PostTown\": \"London\", \"AddressLine1\": \"1\", \"AddressLine2\": \"Buckingham Palace\", \"AddressLine3\": \"The place to be\"}, \"executorSurname\": \"executor2_surname\", \"executorForenames\": \"executor2_forenames\", \"executorEmailAddress\": \"[email protected]\"}}], \"deceasedFullAliasNameList\": [{\"id\": \"1970ac9d-532e-48f2-8851-04ae6eec973f\", \"value\": {\"FullAliasName\": \"deceased_alias1_1633469477956\"}}, {\"id\": \"86c842aa-e10f-44d5-8c28-a7ce8a1cb0eb\", \"value\": {\"FullAliasName\": \"deceased_alias2\"}}]}",
"ce_description": "upload_document_event_description_text",
"ce_event_id": "uploadDocument",
"ce_event_name": "Upload document",
"ce_id": 30638630,
"ce_security_classification": "PUBLIC",
"ce_state_id": "WillLodgementCreated",
"ce_state_name": "Will lodgement created",
"ce_summary": "upload_document_event_summary_text",
"ce_user_first_name": "Probate",
"ce_user_id": "349978",
"ce_user_last_name": "Backoffice",
"extraction_date": "2021-10-06"
}
正如您可以看到ce_data欄位包含一個陣列。
當我使用 Databricks 在 Apache Spark 中讀取 JSON 時,我得到以下 printSchema()
root
|-- cd_created_date: string (nullable = true)
|-- cd_jurisdiction: string (nullable = true)
|-- cd_last_modified: string (nullable = true)
|-- cd_last_state_modified_date: string (nullable = true)
|-- cd_latest_state: string (nullable = true)
|-- cd_reference: long (nullable = true)
|-- cd_security_classification: string (nullable = true)
|-- cd_version: long (nullable = true)
|-- ce_case_data_id: long (nullable = true)
|-- ce_case_type_id: string (nullable = true)
|-- ce_case_type_version: long (nullable = true)
|-- ce_created_date: string (nullable = true)
**|-- ce_data: string (nullable = true)**
|-- ce_description: string (nullable = true)
|-- ce_event_id: string (nullable = true)
|-- ce_event_name: string (nullable = true)
|-- ce_id: long (nullable = true)
|-- ce_security_classification: string (nullable = true)
|-- ce_state_id: string (nullable = true)
|-- ce_state_name: string (nullable = true)
|-- ce_summary: string (nullable = true)
|-- ce_user_first_name: string (nullable = true)
|-- ce_user_id: string (nullable = true)
|-- ce_user_last_name: string (nullable = true)
|-- extraction_date: string (nullable = true)
正如您從上面 Databricks 中的 PrintSchema 中看到的那樣,欄位 ce_data 沒有顯示為陣列。
但是,我想查詢 ce_data 欄位中的陣列。例如,我想撰寫一個查詢來創建一個 LodgeDate = 1970-03-03?
我的嘗試類似于
test = spark.sql("""select ce_data from testtable where ce_data.lodgeDate = '1970-03-03'""")
在 Databricks 中輸入上述代碼時出現的錯誤是:
Can't extract value from ce_data#12747: need struct type but got string;
因此,我首先需要了解為什么我沒有在 printSchema() 中看到陣列,但是我的主要問題是如何使用 sparkSQL 查詢 JSON 中的陣列。
我還想知道是否需要匯入一些庫?
uj5u.com熱心網友回復:
正如您已經提到的,這ce_data是一個帶有 JSON 內容的字串,假設 JSON 有效,您可以使用get_json_object函式來提取 JSON 的屬性,如下所示
spark.sql("""
select ce_data
from testtable
where get_json_object(ce_data, "$.lodgedDate") = "1970-03-03"
""").show()
但是,如果你問我,我會說我更喜歡 Python 語法而不是 SQL 語法。這樣干凈多了
from pyspark.sql import functions as F
(df
.where(F.get_json_object('ce_data', '$.lodgedDate') == '1970-03-03')
.show()
)
uj5u.com熱心網友回復:
看起來您的欄位是 json 字串而不是陣列。您可以使用“from_json”方法將其轉換為結構型別。然后你可以用上面的代碼查詢它。請注意,您將需要 json 的架構才能使用 from_json 方法。例如
from pyspark.sql import functions as f
# get your json schema
json_string = "{\"willDate\": \"1950-01-01\", \"jointWill\": \"Yes\", \"lodgedDate\": \"1970-03-03\", \"codicilDate\": \"1962-02-02\", \"executorTitle\": \"Mr\", \"lodgementType\": \"safeCustody\", \"deceasedGender\": \"male\", \"applicationType\": \"Personal\", \"deceasedAddress\": {\"County\": \"London\", \"Country\": \"United Kingdom\", \"PostCode\": \"SW1A 1AA\", \"PostTown\": \"London\", \"AddressLine1\": \"1\", \"AddressLine2\": \"Buckingham Palace\", \"AddressLine3\": \"The place to be\"}, \"deceasedSurname\": \"E2E_deceased_surname_1633469477956\", \"executorAddress\": {\"County\": \"London\", \"Country\": \"United Kingdom\", \"PostCode\": \"SW1A 1AA\", \"PostTown\": \"London\", \"AddressLine1\": \"1\", \"AddressLine2\": \"Buckingham Palace\", \"AddressLine3\": \"The place to be\"}, \"executorSurname\": \"executor1_surname\", \"numberOfCodicils\": \"3\", \"registryLocation\": \"Liverpool\", \"deceasedForenames\": \"E2E_deceased_forenames_1633469477956\", \"documentsUploaded\": [{\"id\": \"b1181bfb-d0a7-49d8-8301-b06e58eb42c1\", \"value\": {\"Comment\": \"test file to upload\", \"DocumentLink\": {\"document_url\": \"http://dm-store-aat.service.core-compute-aat.internal/documents/60cd0a78-648e-4af7-9f83-15a380f1786d\", \"document_filename\": \"test_file_for_document_upload.png\", \"document_binary_url\": \"http://dm-store-aat.service.core-compute-aat.internal/documents/60cd0a78-648e-4af7-9f83-15a380f1786d/binary\"}, \"DocumentType\": \"email\"}}], \"executorForenames\": \"executor1_forenames\", \"deceasedDateOfBirth\": \"1930-01-01\", \"deceasedDateOfDeath\": \"2017-01-01\", \"deceasedTypeOfDeath\": \"diedOnOrAbout\", \"deceasedEmailAddress\": \"[email protected]\", \"executorEmailAddress\": \"[email protected]\", \"deceasedAnyOtherNames\": \"Yes\", \"additionalExecutorList\": [{\"id\": \"bd8d7ca2-ed84-424c-a241-40fd99b15596\", \"value\": {\"executorTitle\": \"Dr\", \"executorAddress\": {\"County\": \"London\", \"Country\": \"United Kingdom\", \"PostCode\": \"SW1A 1AA\", \"PostTown\": \"London\", \"AddressLine1\": \"1\", \"AddressLine2\": \"Buckingham Palace\", \"AddressLine3\": \"The place to be\"}, \"executorSurname\": \"executor2_surname\", \"executorForenames\": \"executor2_forenames\", \"executorEmailAddress\": \"[email protected]\"}}], \"deceasedFullAliasNameList\": [{\"id\": \"1970ac9d-532e-48f2-8851-04ae6eec973f\", \"value\": {\"FullAliasName\": \"deceased_alias1_1633469477956\"}}, {\"id\": \"86c842aa-e10f-44d5-8c28-a7ce8a1cb0eb\", \"value\": {\"FullAliasName\": \"deceased_alias2\"}}]}"
your_json_schema = f.schema_of_json(json_string)
df = df.withColumn(“ce_data”, f.from_json(df.ce_data,schema=your_json_schema)
df = df.filter("ce_data.lodgeDate = '1970-03-03'")
如果結構中有陣列。您可以使用 f.explode 方法為每個陣列欄位創建一個新行。然后從那里您可以像使用任何列一樣查詢它們
轉載請註明出處,本文鏈接:https://www.uj5u.com/ruanti/311439.html
