在大熊貓資料框中計算每行歷史值的最有效方法是什么？-有解無憂

假設我有兩個 Pandas 資料框（df_a 和 df_b），其中每一行代表一個玩具和關于該玩具的特征。一些假裝功能：

Was_Sold (是/否)
顏色
大小_組
形狀
日期_制造

假設 df_a 相對較小（10 萬行）而 df_b 相對較大（> 100 萬行）。

然后對于df_a 中的每一行，我想：

從 df_b 中找出與 df_a 中的玩具型別相同的所有玩具（例如，相同的顏色組）
df_b 玩具也必須在給定的 df_a 玩具之前制作
然后找到銷售的比率（所以統計已售/所有匹配）

進行上述每行計算的最有效方法是什么？

到目前為止，我想出的最好的東西如下所示。（注意代碼可能有一兩個錯誤，因為我是從不同的用例粗略輸入的）

cols = ['Color', 'Size_Group', 'Shape']

# Run this calculation for multiple features
for col in cols:
    
    print(col   ' - Started')
    
    # Empty list to build up the calculation in
    ratio_list = []
    
    # Start the iteration
    for row in df_a.itertuples(index=False):
        
        # Relevant values from df_a
        relevant_val = getattr(row, col)
        created_date = row.Date_Made
        
        # df to keep the overall prior toy matches
        prior_toys = df_b[(df_b.Date_Made < created_date) & (df_b[col] == relevant_val)]
        prior_count = len(prior_toys)

        # Now find the ones that were sold
        prior_sold_count = len(prior_toys[prior_toys.Was_Sold == "Y"])
                         
        # Now make the calculation and append to the list
        if prior_count == 0:
            ratio = 0
        else:
            ratio = prior_sold_count / prior_count
        ratio_list.append(ratio)
        
    # Store the calculation in the original df_a
    df_a[col   '_Prior_Sold_Ratio'] = ratio_list
    print(col   ' - Finished')

使用.itertuples()很有用，但這仍然很慢。有沒有更有效的方法或我缺少的東西？

編輯添加了以下腳本，它將模擬上述場景的資料：

import numpy as np
import pandas as pd

colors = ['red', 'green', 'yellow', 'blue']
sizes = ['small', 'medium', 'large']
shapes = ['round', 'square', 'triangle', 'rectangle']
sold = ['Y', 'N']
size_df_a = 200
size_df_b = 2000

date_start = pd.to_datetime('2015-01-01')
date_end = pd.to_datetime('2021-01-01')

def random_dates(start, end, n=10):

    start_u = start.value//10**9
    end_u = end.value//10**9

    return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')

df_a = pd.DataFrame(
    {
    'Color': np.random.choice(colors, size_df_a),
    'Size_Group': np.random.choice(sizes, size_df_a),
    'Shape': np.random.choice(shapes, size_df_a),
    'Was_Sold': np.random.choice(sold, size_df_a),
    'Date_Made': random_dates(date_start, date_end, n=size_df_a)
    }
    )

df_b = pd.DataFrame(
    {
    'Color': np.random.choice(colors, size_df_b),
    'Size_Group': np.random.choice(sizes, size_df_b),
    'Shape': np.random.choice(shapes, size_df_b),
    'Was_Sold': np.random.choice(sold, size_df_b),
    'Date_Made': random_dates(date_start, date_end, n=size_df_b)
    }
    )

uj5u.com熱心網友回復：

首先，我認為使用關系資料庫和 SQL 查詢會更有效地計算。事實上，過濾器可以通過索引列、執行資料庫連接、一些高級過濾和計算結果來完成。優化的關系資料庫可以基于簡單的 SQL 查詢（基于散列的行分組、二進制搜索、集合的快速交集等）生成高效的演算法。遺憾的是，Pandas 不太適合執行這樣高效的高級請求。迭代熊貓資料幀也很慢，盡管我不確定在這種情況下僅使用熊貓可以緩解這種情況。希望您可以使用一些 Numpy 和 Python 技巧，并（部分）實作快速關系資料庫引擎的功能。

此外，純 Python物件型別很慢，尤其是（unicode）字串。因此，**首先將列型別轉換為高效的型別可以節省大量時間（和記憶體）。例如，Was_Sold列不需要包含“Y”/“N”字串物件：在這種情況下可以使用布林值。因此，讓我們轉換一下：

df_b.Was_Sold = df_b.Was_Sold == "Y"

最后，目前的演算法有一個壞的復雜性：O(Na * Nb)這里Na是行數df_a，并Nb在行數df_b。由于非平凡的條件，這并不容易改善。第一個解決方案是提前df_b按col列分組，以避免昂貴的完整迭代df_b（之前使用完成df_b[col] == relevant_val）。然后，可以對預先計算的組的日期進行排序，以便稍后執行快速二分查找。然后您可以使用 Numpy 有效地計算布林值（使用np.sum）。

請注意，這樣做prior_toys['Was_Sold']比prior_toys.Was_Sold.

這是結果代碼：

cols = ['Color', 'Size_Group', 'Shape']

# Run this calculation for multiple features
for col in cols:
    print(col   ' - Started')
    
    # Empty list to build up the calculation in
    ratio_list = []

    # Split df_b by col and sort each (indexed) group by date
    colGroups = {grId: grDf.sort_values('Date_Made') for grId, grDf in df_b.groupby(col)}

    # Start the iteration
    for row in df_a.itertuples(index=False):
        # Relevant values from df_a
        relevant_val = getattr(row, col)
        created_date = row.Date_Made
        
        # df to keep the overall prior toy matches
        curColGroup = colGroups[relevant_val]
        prior_count = np.searchsorted(curColGroup['Date_Made'], created_date)
        prior_toys = curColGroup[:prior_count]

        # Now find the ones that were sold
        prior_sold_count = prior_toys['Was_Sold'].values.sum()

        # Now make the calculation and append to the list
        if prior_count == 0:
            ratio = 0
        else:
            ratio = prior_sold_count / prior_count
        ratio_list.append(ratio)
        
    # Store the calculation in the original df_a
    df_a[col   '_Prior_Sold_Ratio'] = ratio_list
    print(col   ' - Finished')

這在我的機器上快了5.5 倍。

pandas 資料框的迭代是減速的主要來源。確實，prior_toys['Was_Sold']由于pandas內部函式重復呼叫的巨大開銷，需要一半的計算時間Na......使用Numba可能有助于降低緩慢迭代的成本。請注意，可以通過colGroups提前拆分子組來增加復雜性( O(Na log Nb))。這應該有助于完全消除prior_sold_count. 生成的程式應該比原始程式快 10 倍左右。

轉載請註明出處，本文鏈接：https://www.uj5u.com/shujuku/384591.html

標籤：熊猫数据框表现循环

上一篇：如何使用相同的鍵值python連接dicts串列

下一篇：回圈過濾資料框以查看值是否在串列列中