我有調查資料,我正試圖將其整理成更可用于分析的資料。
我已經能夠找到大多數問題的體面指南,但有一個 - 解決傳遞到單個列中的多項選擇題。
將其拆分為單獨的列的最佳做法是什么?
我通常會在 SQL 中執行此操作,并且我會在檢查每個不同選項時創建一個 CASE,創建與選項一樣多的列。
這是我正在使用的示例表:
---- ----------------------------
| ID | When did you attend? |
---- ----------------------------
| 1 | Monday\nTuesday |
---- ----------------------------
| 2 | Monday\nTuesday\nWednesday |
---- ----------------------------
| 3 | Monday\nFriday\nSaturday |
---- ----------------------------
這就是我認為轉換成的有意義的:
---- ---------------------------- -------- --------- ----------- ---------- -------- ---------- --------
| ID | When did you attend? | Monday | Tuesday | Wednesday | Thursday | Friday | Saturday | Sunday |
---- ---------------------------- -------- --------- ----------- ---------- -------- ---------- --------
| 1 | Monday\nTuesday | True | True | False | False | False | False | False |
---- ---------------------------- -------- --------- ----------- ---------- -------- ---------- --------
| 2 | Monday\nTuesday\nWednesday | True | True | True | False | False | False | False |
---- ---------------------------- -------- --------- ----------- ---------- -------- ---------- --------
| 3 | Monday\nFriday\nSaturday | True | False | False | False | True | True | False |
---- ---------------------------- -------- --------- ----------- ---------- -------- ---------- --------
到目前為止,我已經弄清楚如何使用 將第二列中的字串部分傳遞到串列中df.iloc[:,1].str.split('\n'),但我還沒有弄清楚接下來的步驟(即使那是必要的)。
對我來說,這似乎可以更輕松地根據參加的天數對其余資料進行細分。
uj5u.com熱心網友回復:
我們可以使用str.get_dummies分隔符周圍的字串拆分\n并創建指標變數的資料框,然后reindex沿列軸將此資料框更改dtype為布林值,最后join將給定的資料框與指標資料框一起使用以獲得結果
import calendar
df.join(
df['When did you attend?'].str.get_dummies(sep='\n')
.reindex(calendar.day_name, axis=1, fill_value=0).astype(bool)
)
ID When did you attend? Monday Tuesday Wednesday Thursday Friday Saturday Sunday
0 1 Monday\nTuesday True True False False False False False
1 2 Monday\nTuesday\nWednesday True True True False False False False
2 3 Monday\nFriday\nSaturday True False False False True True False
uj5u.com熱心網友回復:
將字串轉換為串列后,您可以展開串列列以獲取每天的單獨行。然后,您可以旋轉表格以使其成為每一天的一列。
使用此資料框:
df = pd.DataFrame({
'ID': [1,2,3],
'When did you attend?': ['Monday\nTuesday', 'Monday\nTuesday\nWednesday', 'Monday\nFriday\nSaturday']
})
拆分字串:
df['When did you attend?'] = df['When did you attend?'].str.split('\n')
ID When did you attend?
0 1 [Monday, Tuesday]
1 2 [Monday, Tuesday, Wednesday]
2 3 [Monday, Friday, Saturday]
爆炸,為每個 ID 創建一行,出席天數:
df = df.explode('When did you attend?')
ID When did you attend?
0 1 Monday
0 1 Tuesday
1 2 Monday
1 2 Tuesday
1 2 Wednesday
...
樞軸將天作為列:
df = df.pivot(index='ID', columns='When did you attend?', values='When did you attend?')
When did you attend? Friday Monday Saturday Tuesday Wednesday
ID
1 NaN Monday NaN Tuesday NaN
2 NaN Monday NaN Tuesday Wednesday
3 Friday Monday Saturday NaN NaN
最后,重新排序列并將值轉換為布林值:
df = df[['Monday', 'Tuesday', 'Wednesday', 'Friday', 'Saturday']]
df = df.fillna(0).astype('bool')
When did you attend? Monday Tuesday Wednesday Friday Saturday
ID
1 True True False False False
2 True True True False False
3 True False False True True
取決于您以后要做什么,僅爆炸而不旋轉可能是更有用的結構。
請注意,列中缺少不在資料中的天數。您可能希望為缺失的天數添加列,除非您的資料無論如何都包含每一天。
uj5u.com熱心網友回復:
df.join(df["When did you attend?"].str.split("\n").explode().rename("Day")).assign(
exist=True
).pivot(index=["ID", "When did you attend?"], columns="Day").droplevel(0, 1).reindex(
["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"],
axis=1,
).fillna(
False
).reset_index().rename_axis(
columns=None
)
分解它:
- 拆分后爆炸該列
tmp = df["When did you attend?"].str.split("\n").explode().rename("Day")
tmp
>>>
0 Monday
0 Tuesday
1 Monday
1 Tuesday
1 Wednesday
2 Monday
2 Friday
2 Saturday
Name: Day, dtype: object
- 連接回原始 df 并創建 True 的存在列(稍后用作樞軸中的值列)
df.join(tmp).assign(exist=True)
ID When did you attend? Day exist
0 1 Monday\nTuesday Monday True
0 1 Monday\nTuesday Tuesday True
1 2 Monday\nTuesday\nWednesday Monday True
1 2 Monday\nTuesday\nWednesday Tuesday True
1 2 Monday\nTuesday\nWednesday Wednesday True
2 3 Monday\nFriday\nSaturday Monday True
2 3 Monday\nFriday\nSaturday Friday True
2 3 Monday\nFriday\nSaturday Saturday True
- 透視 清理(降低列級別,重新索引一周中的所有天數以解決資料集中缺失的天數,然后用 False 填充)
ID When did you attend? Monday Tuesday Wednesday Thursday Friday Saturday Sunday
0 1 Monday\nTuesday True True False False False False False
1 2 Monday\nTuesday\nWednesday True True True False False False False
2 3 Monday\nFriday\nSaturday True False False False True True False
轉載請註明出處,本文鏈接:https://www.uj5u.com/qiye/362560.html
上一篇:標記組中最大行的列
下一篇:用檔案標記分析情緒
