根據另一個df中的列計算一個df中的行數-有解無憂

好的，所以我有第一個資料框 df1：

|timestamp                |ip         |
|2022-01-06 11:58:53 00:00|1.1.1.5.   |
|2022-01-08 03:56:35 00:00|10.10.10.24|
|2022-01-09 22:29:30 00:00|3.3.3.89.  |
|2022-03-08 22:37:52 00:00|8.8.8.88.  |

還有第二個資料框 df2：

|timestamp                |other|
|2022-01-07 22:08:59 00:00|other|
|2022-01-07 23:08:59 00:00|other|
|2022-01-09 17:04:09 00:00|other|
|2022-03-05 17:04:09 00:00|other|

我想根據 df1 中的 2 個連續時間戳計算 df2 中有多少行，這意味著：

|timestamp                |ip         |count|
|2022-01-06 11:58:53 00:00|1.1.1.5    |NaN  |
|2022-01-08 03:56:35 00:00|10.10.10.24|2    |
|2022-01-09 22:29:30 00:00|3.3.3.89   |1    |
|2022-03-08 22:37:52 00:00|8.8.8.88   |1    |

我嘗試的是首先在 df1 中使用先前的時間戳創建另一列：

df1 = df1.assign(timestamp_b4=df1.timestamp.shift(1)).fillna({'timestamp_b4': df1.timestamp})

這給了我：

|timestamp                |ip         |timestamp_b4             |
|2022-01-06 11:58:53 00:00|1.1.1.5    |2022-03-08 22:37:52 00:00|
|2022-01-08 03:56:35 00:00|10.10.10.24|2022-01-06 11:58:53 00:00|
|2022-01-09 22:29:30 00:00|3.3.3.89   |2022-01-08 03:56:35 00:00|
|2022-03-08 22:37:52 00:00|8.8.8.88   |2022-01-09 22:29:30 00:00|

然后做某種

s = (df2[df2['timestamp'].between(df1['timestamp'], df1['timestamp_b4'])].size())

但不幸的是，它不起作用，因為 pandas 需要比較相同標記的物件。

有沒有一種好的 pandas/pythonic 方法可以做到這一點？

謝謝

uj5u.com熱心網友回復：

這是一種方法：

df1.merge(df2, on='timestamp', how='outer').sort_values('timestamp') \
    .assign(c1=df1.loc[~df1['ip'].isna()]['ip'], c2=lambda x: x['c1'].bfill() ) \
    .assign(count=lambda x: x.groupby('c2').apply('count').reset_index(drop=True)['timestamp']-1) \
    .drop(['other','c1','c2'], axis=1).dropna().astype({'count': 'int32'})

                   timestamp           ip  count
0  2022-01-06 11:58:53 00:00  1.1.1.5.         0
1  2022-01-08 03:56:35 00:00  10.10.10.24      2
2  2022-01-09 22:29:30 00:00  3.3.3.89.        1
3  2022-03-08 22:37:52 00:00  8.8.8.88.        1

這種方法合并然后按時間戳排序，然后創建另一列 - c2 - 用于復制 df1 時間戳，然后將其回填到 df2 時間戳。從那里實體按 df1 時間戳（反映在 c2 列中）分組并計數。換句話說，df1 時間戳的回填允許它用作分組鍵來計算前面的 df2 時間戳。之后，df 被修剪回以匹配輸出要求。

uj5u.com熱心網友回復：

試試這個，這是你可以做些什么來找到解決方案的一個例子

import pandas as pd
table1 = {
    'timestamp':['2022-01-06 11:58:53 00:00','2022-01-08 03:56:35 00:00',
                 '2022-01-09 22:29:30 00:00','2022-03-08 22:37:52 00:00'],
    'other':['other','other','other','other']
              }
df1 = pd.DataFrame(table1)

table2 = {
    'timestamp':['2022-01-07 23:08:59 00:00','2022-01-07 22:08:59 00:00',
                 '2022-03-05 17:04:09 00:00','2022-01-09 17:04:09 00:00'],
    'ip':['1.1.1.5.','10.10.10.24','3.3.3.89.','8.8.8.88.']
    
              }

df2 = pd.DataFrame(table2)

print(f'\n\n-------------df1-----------\n\n')
print(df2)
print(f'\n\n-------------df2-----------\n\n')
print(df1)

listdf1 = df1['timestamp'].values.tolist()
def func(line):
    cont = df1.loc[df1['timestamp'].str.contains(line[0][:7], case = False)]
    temp = line.name - 1
    if temp == -1:
        temp = 0

    try :
        cont = [cont['timestamp'].iloc[temp],line[0]]
    except:
        cont = [line[0],line[0]]

    cont2 = df2['timestamp'].loc[df2['timestamp'].str.contains(line[0][:7], case = False)]
    
    repetitions = 0
    for x in cont2:

        if int(x[8:10]) >= int(cont[0][8:10]) and int(x[8:10]) <= int(cont[1][8:10]) and int(x[8:10]) <= int(line[0][8:10]):
            repetitions  = 1
    return repetitions
    

print(f'\n\n-------------BREAK-----------\n\n')

df1['count'] = df1.apply(func, axis = 1)

print(df1)

uj5u.com熱心網友回復：

def time_compare(df1,df2):
  return [np.sum((df1['timestamp'].values[i-1] < df2['timestamp'].values) & (df1['timestamp'].values[i] > df2['timestamp'].values)) for i in range(len(df1.timestamp))]

df2.join(pd.Series(time_compare(df1,df2), name='Count'))

奇怪的是我不能像往常一樣發布資料幀輸出：

指數	時間戳	其他	數數
0	2022-01-07 22:08:5900:00	其他	0
1	2022-01-07 23:08:5900:00	其他	2
2	2022-01-09 17:04:0900:00	其他	1
3	2022-03-05 17:04:0900:00	其他	1

轉載請註明出處，本文鏈接：https://www.uj5u.com/net/484943.html

標籤：Python 熊猫数据框

上一篇：dataframeput必須是unicode字串，而不是0，如何給字串而不是dataframe

下一篇：根據不同的資料框賦予一個值