根據其他列中LIST中的值創建新列-有解無憂

這個問題看起來很簡單，但我遇到了很多麻煩，而且還沒有在任何地方看到它。我有一列在每一行中包含一個不同的串列，我想要做的就是根據特定值是否在該串列中創建一個新列。資料如下所示：

Col1
[5,6,23,7,20,21]    
[0,7,20,21]
[3,4,5,23,7,20,21]
[2,3,23,7,20,21]
[3,4,5,23,7,20,21]

每個號碼對應一個特定的值，所以0 = 'apple'，2 = 'grape'等...

雖然每個串列中有多個值，但我實際上只是在尋找某些值，特別是 0, 2, 4, 6, 16, 17

所以我想要做的是添加一個新列，其值對應于在Col1.

這就是解決方案應該是什么：

Col1               Col2
[5,6,23,7,20,21]   Pear
[0,7,20,21]        Apple
[3,4,5,23,7,20,21] Watermelon
[2,3,23,7,20,21]   Grape
[16,20,21]         Pineapple

我努力了：

df['Col2'] = np.where(0 in df['Col1'], 'Apple',
                np.where(2 in df['Col1'], 'Grape', 
                   np.where(4 in df['Col1'], 'Watermelon', )

依此類推...但這將所有值默認為 Apple

Col1               Col2
[5,6,23,7,20,21]   Apple
[0,7,20,21]        Apple
[3,4,5,23,7,20,21] Apple
[2,3,23,7,20,21]   Apple
[16,20,21]         Apple

通過將上述內容放入for回圈中，我能夠成功地做到這一點，但是我遇到了問題。代碼：

df['Col2'] = ''
for i in range(0,df.shape[0]):
   df['Col2'][i] = np.where(0 in df['Col1'][i], 'Apple',
                   np.where(2 in df['Col1'][i], 'Grape', 
                      np.where(4 in df['Col1'][i], 'Watermelon', )

我得到了我正在尋找的結果，但我遇到了警告：

<ipython-input-638-5dfd74b69688>:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

我認為警告是因為我已經創建了空白列，但我這樣做的唯一原因是因為如果我沒有創建它會出錯。此外，當我嘗試執行一個簡單的時df['Col2'].value_counts()，我收到一個錯誤：TypeError: unhashable type: 'numpy.ndarray'. value_counts()即使我收到此錯誤，結果仍然顯示，這很奇怪。

我不完全確定如何繼續，我已經嘗試了很多其他方法來創建這個專欄，但沒有一個能夠作業。任何建議表示贊賞！

uj5u.com熱心網友回復：

使用explode：

d = {0: 'Apple', 2: 'Grape', 4: 'Watermelon', 6: 'Banana', 16: 'Pear', 17: 'Orange'}
df['Col2'] = df['Col1'].explode().map(d).dropna().groupby(level=0).apply(', '.join)
print(df)

# Output:
                       Col1        Col2
0     [5, 6, 23, 7, 20, 21]      Banana
1            [0, 7, 20, 21]       Apple
2  [3, 4, 5, 23, 7, 20, 21]  Watermelon
3     [2, 3, 23, 7, 20, 21]       Grape
4  [3, 4, 5, 23, 7, 20, 21]  Watermelon

uj5u.com熱心網友回復：

遍歷串列值并將它們映射到正確的水果，忽略不需要的水果。如果沒有匹配項，則設定為 NaN。使用str.join包括多個匹配的可能性。

要按行應用此邏輯，請使用 Series.apply

import numpy as np

mapping = {0: 'Apple', 2: 'Grape', 4: 'Watermelon'}

df['Col2'] = df['Col1'].apply(lambda lst: ', '.join(mapping[n] for n in lst if n in mapping) or np.nan)

輸出：

>>> df

                       Col1        Col2
0     [5, 6, 23, 7, 20, 21]         NaN
1            [0, 7, 20, 21]       Apple
2  [3, 4, 5, 23, 7, 20, 21]  Watermelon
3     [2, 3, 23, 7, 20, 21]       Grape
4  [3, 4, 5, 23, 7, 20, 21]  Watermelon

表現

請注意，這應該比 Corralien 的解決方案更快。

設定：

df = pd.DataFrame({
    'Col1': [[5, 6, 23, 7, 20, 21],
             [0, 7, 20, 21],
             [3, 4, 5, 23, 7, 20, 21],
             [2, 3, 23, 7, 20, 21],
             [3, 4, 5, 23, 7, 20, 21]]
})

mapping = {0: 'Apple', 2: 'Grape', 4: 'Watermelon'}

def number_to_fruit(lst):
    return ', '.join(mapping[n] for n in lst if n in mapping) or np.nan

# Simulate a large DataFrame
n = 20000
df = pd.concat([df]*n, ignore_index=False)

>>> df.shape

(100000, 1)

時間：

# Using apply. (I've added dropna for a more fair comparison)
>>> %timeit -n 10 df['Col1'].apply(number_to_fruit).dropna()

116 ms ± 7.74 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# Corralien's solution 
>>> %timeit -n 10 df['Col1'].explode().map(mapping).dropna().groupby(level=0).apply(', '.join)

710 ms ± 71 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

轉載請註明出處，本文鏈接：https://www.uj5u.com/ruanti/381277.html

標籤：Python 熊猫数据框

上一篇：如何將名稱中帶有“date”的所有資料框列更改為datetime64[ns]

下一篇：具有多重索引的PandasDataframe中的Groupby