我的資料框有多個列,例如 ID、組織、日期、位置等。我正在嘗試提取“組織”列中的“組織”值。我想要的輸出應該是新列中的多個組織的名稱,用逗號分隔。例如:
| ID | 組織 |
|---|---|
| 1 | [{組織=葛蘭素史克,character_offset=10512},{組織=Vulpes基金,character_offset=13845}] |
| 2 | [{organization=亞馬遜,character_offset=14589},{organization=Sinovac,character_offset=18923}] |
我希望輸出類似于:
| ID | 組織 |
|---|---|
| 1 | 葛蘭素史克、Vulpes 基金 |
| 2 | 亞馬遜,科興 |
我嘗試了以下代碼(輸出為 NaN):
latin_combined['newOrg'] = latin_combined['organizations'].str[0].str['organization']
編輯:
df.head(5)['organizations'].to_dict()給我以下輸出:
{0: '[{organization= Vac, character_offset=14199}, {organization=Health, character_offset=1494}]',
1: '[{organization=Store, character_offset=700}, {organization= Museum, character_offset=1711}]',
2: '[{organization= Mart, character_offset=8232}, {organization= Rep, character_offset=5517}]',
3: '[{organization= Lodge, character_offset=3881}, {organization= Hotel, character_offset=5947}]',
4: '[{organization=Airport, character_offset=3881}, {organization=Landmark, character_offset=5947}]'}
任何建議都會有所幫助。
uj5u.com熱心網友回復:
看來你有一個字串。您可以使用以 , pivotregex分隔的鍵、值對來提取,=如下所示:
(df['organizations'].str.extractall('([^{=,] )= *([^=,}] )')
.rename({0:'key', 1:'value'}, axis = 1).reset_index()
.groupby(['level_0', 'key'])['value'].agg(', '.join).unstack())
key character_offset organization
level_0
0 14199, 1494 Vac, Health
1 700, 1711 Store, Museum
2 8232, 5517 Mart, Rep
3 3881, 5947 Lodge, Hotel
4 3881, 5947 Airport, Landmark
資料
d = {0: '[{organization= Vac, character_offset=14199}, {organization=Health, character_offset=1494}]',
1: '[{organization=Store, character_offset=700}, {organization= Museum, character_offset=1711}]',
2: '[{organization= Mart, character_offset=8232}, {organization= Rep, character_offset=5517}]',
3: '[{organization= Lodge, character_offset=3881}, {organization= Hotel, character_offset=5947}]',
4: '[{organization=Airport, character_offset=3881}, {organization=Landmark, character_offset=5947}]'}
df = pd.Series(d).to_frame('organizations')
uj5u.com熱心網友回復:
1.根據您最近的資料框更新進行了更新:
data = {'Organizations': ['[{organization= Vac, character_offset=14199}, {organization=Health, character_offset=1494}]',
'[{organization=Store, character_offset=700}, {organization= Museum, character_offset=1711}]',
'[{organization= Mart, character_offset=8232}, {organization= Rep, character_offset=5517}]',
'[{organization= Lodge, character_offset=3881}, {organization= Hotel, character_offset=5947}]',
'[{organization=Airport, character_offset=3881}, {organization=Landmark, character_offset=5947}]']}
df = pd.DataFrame(data)
df
| 指數 | 組織 |
|---|---|
| 0 | [{組織= Vac, character_offset=14199}, {組織=健康, character_offset=1494}] |
| 1 | [{組織=商店,字符偏移=700},{組織=博物館,字符偏移=1711}] |
| 2 | [{organization= Mart, character_offset=8232}, {organization= Rep, character_offset=5517}] |
| 3 | [{組織=旅館,character_offset=3881},{組織=酒店,character_offset=5947}] |
| 4 | [{組織=機場,character_offset=3881},{組織=地標,character_offset=5947}] |
2.在您想要的列上使用''.join() regexwith :.apply()
import re
df.Organizations = df.Organizations.apply(lambda x: ', '.join(re.findall(r'{[^=] =\s*([^=,}] )', x)))
df
3.結果:
| 指數 | 組織 |
|---|---|
| 0 | 真空,健康 |
| 1 | 商店, 博物館 |
| 2 | 市場代表 |
| 3 | 旅館, 酒店 |
| 4 | 機場, 地標 |
我個人認為,在將資料放入資料框之前,您應該嘗試更好地廢棄和/或清理資料。
uj5u.com熱心網友回復:
這是你想要做的嗎?
latin_combined['newOrg'] = latin_combined['organizations'].apply(lambda x : x.split(',')[0])
uj5u.com熱心網友回復:
您可以將串列理解與 apply 一起使用:
import pandas as pd
df = pd.DataFrame([[[{'organization':'Glaxosmithkline', 'character_offset':10512}, {'organization':'Vulpes Life Sciences Fund', 'character_offset':13845}]]], columns=['newOrg'])
df['Organizations'] = df['newOrg'].apply(lambda x: [i['organization'] for i in x])
輸出:
| 新組織 | 組織 | |
|---|---|---|
| 0 | [{'組織':'葛蘭素史克','character_offset':10512},{'組織':'Vulpes生命科學基金','character_offset':13845}] | ['葛蘭素史克','Vulpes生命科學基金'] |
uj5u.com熱心網友回復:
你可以這樣做:
df['organizations'].str.extractall(r"organization= *(\w )") \
.groupby(level=0).agg(', '.join).rename(columns={0:'Organizations'})
Organizations
0 Vac, Health
1 Store, Museum
2 Mart, Rep
3 Lodge, Hotel
4 Airport, Landmark
轉載請註明出處,本文鏈接:https://www.uj5u.com/net/484932.html
上一篇:清理網路抓取資料并組合在一起?
