我不能將 Pyspark 用作 FYI!
我的資料如下所示:
0 { "CountryOfManufacture": "China", "Tags": ["U...
1 { "CountryOfManufacture": "China", "Tags": ["U...
2 { "CountryOfManufacture": "China", "Tags": [] }
3 { "CountryOfManufacture": "Japan", "Tags": ["3...
4 { "CountryOfManufacture": "Japan", "Tags": ["1...
... ...
222 { "CountryOfManufacture": "USA", "ShelfLife": ...
223 { "CountryOfManufacture": "USA", "ShelfLife": ...
224 { "CountryOfManufacture": "USA", "ShelfLife": ...
225 { "CountryOfManufacture": "USA", "ShelfLife": ...
226 { "CountryOfManufacture": "USA", "ShelfLife": .
所以字典中包含不同的值。我只對第一個(制造國)感興趣,并希望將其拆分,然后添加到另一個資料幀中。
謝謝!
uj5u.com熱心網友回復:
如果您所有的詞典都具有相同的鍵(或者即使它們沒有!請參閱下面 Pranav 的評論!),那么效果pandas.DataFrame.from_records會很好(鏈接到檔案頁面)。
import pandas as pd
data = [{'CountryOfManufacture': 'China', 'col_2': 'a'},
{'CountryOfManufacture': 'Japan', 'col_2': 'b'},
{'CountryOfManufacture': 'China', 'col_2': 'c'},
{'CountryOfManufacture': 'USA', 'col_2': 'd'}]
df = pd.DataFrame.from_records(data)
print(df.head())
# CountryOfManufacture col_2
# 0 China a
# 1 Japan b
# 2 China c
# 3 USA d
如果您只需要一列,您可以在, 之后選擇該列df["CountryOfManufacture"],或者使用exclude關鍵字并提供您不需要的所有列的串列df = pd.DataFrame.from_records(data, exclude=['col_2'])
uj5u.com熱心網友回復:
當我嘗試使用 from_records 時,我的結果如下所示:
CustomFields
0 { "CountryOfManufacture": "China", "Tags": ["U...
1 { "CountryOfManufacture": "China", "Tags": ["U...
2 { "CountryOfManufacture": "China", "Tags": [] }
3 { "CountryOfManufacture": "Japan", "Tags": ["3...
4 { "CountryOfManufacture": "Japan", "Tags": ["1...
我認為這是因為我的資料格式不尋常。我的資料最初是在一個 CSV 檔案中提供的,這是其中一列。所有其他列都是整數/浮點數/物件格式,而當您在 Excel 中查看它時,該列已經是字典格式。
您在下面的示例中使用的資料的格式與我預期的一樣,但這是我轉換為串列時的樣子:
['{ "CountryOfManufacture": "China", "Tags": ["USB Powered"] }', '{ "CountryOfManufacture": "China", "Tags": ["USB Powered"] }', '{ "CountryOfManufacture": "China", "Tags": [] }', '{ "CountryOfManufacture": "Japan", "Tags": ["32GB","USB Powered"] }', '{ "CountryOfManufacture": "Japan", "Tags": ["16GB","USB Powered"] }', '{ "CountryOfManufacture": "Japan", "Tags": ["32GB","USB Powered"] }', '{ "CountryOfManufacture": "Japan", "Tags": ["16GB","USB Powered"] }', '{ "CountryOfManufacture": "Japan", "Tags": ["32GB","USB Powered"] }', '{ "CountryOfManufacture": "Japan", "Tags": ["16GB","USB Powered"] }', '{ "CountryOfManufacture": "Japan", "Tags": ["32GB","USB Powered"] }', '{ "CountryOfManufacture": "Japan", "Tags": ["16GB","USB Powered"] }', '{ "CountryOfManufacture": "Japan", "Tags": ["32GB","USB Powered"] }', '{ "CountryOfManufacture": "Japan", "Tags": ["16GB","USB Powered"] }', '{ "CountryOfManufacture": "Japan", "Tags": ["32GB","USB Powered"] }', '{ "CountryOfManufacture": "Japan", "Tags": ["16GB","USB Powered"] }', '{ "CountryOfManufacture": "China", "Tags": ["Comedy"] }', ...
如您所見,我在每個字典串列之外都有額外的引號,這里用一行說明:['{ "CountryOfManufacture": "China", "Tags": ["USB Powered"] }'。
有沒有辦法在沒有 pyspark 的情況下解決這個問題?
謝謝!
轉載請註明出處,本文鏈接:https://www.uj5u.com/gongcheng/312959.html
