在Pandasdf中，查找A列中的True值是否是自B列中的最后一次True以來他的第一次出現-有解無憂

我在尋找最有效的方式找到，如果True價值column A是去年以來第一次出現True的值column B。

在這個例子中，預期的輸出是column C。

示例 1：

df = pd.DataFrame({
    'A': [False, False, True, False, True, False, True, False, True],
    'B': [True, False, False, False, False, True, False, False, False],
    'C': [False, False, True, False, False, False, True, False, False]
})

	一種	乙	C
0	錯誤的	真的	錯誤的
1	錯誤的	錯誤的	錯誤的
2	真的	錯誤的	真的
3	錯誤的	錯誤的	錯誤的
4	真的	錯誤的	錯誤的
5	錯誤的	真的	錯誤的
6	真的	錯誤的	真的
7	錯誤的	錯誤的	錯誤的
8	真的	錯誤的	錯誤的

示例 2：

df = pd.DataFrame({
    'A': [True, False, False, True, False, True, False, True, False],
    'B': [False, True, False, False, False, False, True, False, False],
    'C': [False, False, False, True, False, False, False, True, False]
})

	一種	乙	C
0	真的	錯誤的	錯誤的
1	錯誤的	真的	錯誤的
2	錯誤的	錯誤的	錯誤的
3	真的	錯誤的	真的
4	錯誤的	錯誤的	錯誤的
5	真的	錯誤的	錯誤的
6	錯誤的	真的	錯誤的
7	真的	錯誤的	真的
8	錯誤的	錯誤的	錯誤的

示例 3：

在這里您可以找到一個帶有更大示例的.csv 檔案

uj5u.com熱心網友回復：

您可以groupby對“B”列的累積總和使用運算來對您的資料框進行分組，如您所描述的那樣。然后您可以使用idxmax來獲取索引，其中每個第一次出現在列“A”中。一旦你有了這些索引，你就可以創建你的新列“C”。

使用idxmax是一個小技巧，因為我們實際上對最大值并不感興趣，因為“A”列只有TrueandFalse作為它的值。idxmax將回傳最大值第一次出現的索引（在這種情況下，是True每個組中的第一次出現），這是我們特別感興趣的。

df = pd.DataFrame({
    'A': [False, False, True, False, True, False, True, False, True],
    'B': [True, False, False, False, False, True, False, False, False],
})

# get a dataframe of the position of the max as well as the max value
indices_df = df["A"].groupby(df["B"].cumsum()).agg(["idxmax", "max"])

# mask to filter out the 0th group
skip_0th = (indices_df.index > 0)

# mask to filter out groups who do not have True as a value
groups_with_true = (indices_df["max"] == True)

# combine masks and retrieve the appropriate index
indices = indices_df.loc[skip_0th & groups_with_true, "idxmax"]

df["C"] = False
df.loc[indices, "C"] = True

print(df)
       A      B      C
0  False   True  False
1  False  False  False
2   True  False   True
3  False  False  False
4   True  False  False
5  False   True  False
6   True  False   True
7  False  False  False
8   True  False  False

更新了示例 2。

我們可以通過對索引系列進行切片以排除任何索引為 0 的條目（例如，從 1 到結尾的標簽切片）來解決此問題。這是有效的，因為我們的groupby操作根據.cumsum. 在示例 1 中，最小的索引標簽將為 1（因為“B”列中的第一個值是 True）。而在示例 2 中，最小的索引標簽將為 0。由于我們不希望 0 影響我們的結果，我們可以簡單地將它從我們的indices.

當我們在對我們的indices系列執行切片后分配“C”時，我們將適當地忽略“B”列中第一次出現 True 之前的所有值。

足夠的文字，讓我們看看一些代碼。

示例 1

print(indices)
1    2
2    6

# Slicing here doesn't change anything, since indices does not have
#  a value corresponding to label position 0
indices = indices.loc[1:]
print(indices)
1    2
2    6

示例 2

print(indices)
0    0
1    3
2    7

# we don't want to include the value from label position 0 in `indices`
#  so we can use slicing to remove it

indices = indices.loc[1:]
print(indices)
1    3
2    7

uj5u.com熱心網友回復：

這是一種方法，也許不是最好的方法。

is_occurred = False
def is_first_occurrence_since(column_to_check, column_occurence):
    global is_occurred
    if is_occurred and column_to_check == True:
        is_occurred = False
        return True
    elif not is_occurred and column_occurence == True:
        is_occurred = True
    return False
df.apply(lambda row: is_first_occurrence_since(row['A'], row['B']), axis=1)

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/371355.html

標籤：Python 熊猫数据框矢量化

上一篇：基于多列子組填充缺失值

下一篇：將Summary()輸出轉換為data.table