我有一個具有以下結構的 df
vid sid pid url
1 A A1 page ABCDEF
2 A A1 page DEF123
3 A A1 page GHI345
4 A A1 page JKL345
5 B B1 page AB12345EF
6 B B2 page IJK
7 B B2 page XYZ
8 C C1 page ABCEF
dict = {'vid':{1:'A',2:'A',3:'A',4:'A',5:'B',6:'B',7:'B',8:'C'},
'sid':{1:'A1',2:'A1',3:'A1',4:'A1',5:'B1',6:'B2',7:'B2',8:'C1'},
'page':{1:'page',2:'page',3:'page',4:'page',5:'page',6:'page',7:'page',8:'pge'},
'url':{1:'ABC',2:'DEF',3:'GHI',4:'JKL',5:'ABC',6:'IJK',7:'XYZ',8:'ABC'}
}
我也有一個串列子串
lst = ['AB','EF']
本質上,我想分組sid并檢查url. 如果串列中的所有元素都作為至少一行中的子字串存在,則回傳sid。如果不是,sid則從 df 中過濾掉。里面的子串url不是連續的。
偽代碼
group by sid
if row in url contains all the substrings in lst
pass
if no row in url contains all substrings in lst
remove the `sid` from the df
將上述邏輯應用于 df 的結果使用 lst
enter code here
vid sid pid url
1 A A1 page ABCDEF
2 A A1 page DEF123
3 A A1 page GHI345
4 A A1 page JKL345
5 B B1 page AB12345EF
8 C C1 page ABCEF
uj5u.com熱心網友回復:
使用布爾索引:
import pandas as pd
gb_df = df.groupby('sid')['url'].transform(lambda x : [x.tolist()]*len(x))
indexing = gb_df.apply(lambda li: any(any(el in text for text in li) for el in lst))
output = df[indexing]
輸出:
vid sid pid url
1 A A1 page ABCDEF
2 A A1 page DEF123
3 A A1 page GHI345
4 A A1 page JKL345
5 B B1 page AB12345EF
8 C C1 page ABCEF
uj5u.com熱心網友回復:
獲取 url 中的布爾掩碼lst:
mask = df.url.str.contains('|'.join(lst))
# Group mask with `Sid` and filter `df`:
df.loc[mask.groupby(df.sid).transform('any')]
vid sid pid url
1 A A1 page ABCDEF
2 A A1 page DEF123
3 A A1 page GHI345
4 A A1 page JKL345
5 B B1 page AB12345EF
8 C C1 page ABCEF
轉載請註明出處,本文鏈接:https://www.uj5u.com/shujuku/369904.html
