我有一個如下所示的資料集。我需要根據 groupby 對狀態關閉的 ticket_id 進行計數,我需要從 ticket_time 列中獲取周開始(星期日)和周末。 輸入資料框
id status ticket_time product country
1260057 open 2021-10-03 01:20:00 Broadband Grenada
2998178 open 2021-10-06 00:00:00 Fixed Voice Bahamas
3762949 closed 2021-10-04 00:00:00 Fixed Voice St Lucia
3766608 closed 2021-10-10 00:00:00 Broadband St Lucia
3767125 closed 2021-10-03 00:00:00 TV Antigua
etc.................
輸出資料幀示例:
country product week_startdate week_enddate Value
antigua TV 2021-10-03 00:00:00 2021-10-10 00:00:00 Count(id) with status=closed
St Lucia Broadband ... .... ...
St Lucia Fixedvoice ... .... ...
對于每個國家/地區和每個產品,我應該獲得處于關閉狀態的 id 計數。groupby 國家和產品是否有效?
輸入資料集
{'ticket_id': [1260057,
2998178,
3762949,
3766608,
3767125,
6050009,
6050608,
6050972,
6052253,
6053697,
6053812,
6053848,
6053970,
6054363,
6054549,
6055529,
6057303,
6057364,
6058428,
6101321,
6103451,
6103688,
6105180,
693170,
1259365,
1259466,
1260843,
3762579,
3762778,
3764140,
3768850,
6050528,
6050595,
6051099,
6053704,
6054013,
6054582,
6055323,
6056096,
6056363,
6057701,
6058251,
6058323,
6102386,
6103121,
6104844,
6104935,
692914,
1260730,
2997275],
'status': ['open',
'open',
'closed',
'closed',
'closed',
'closed',
'open',
'open',
'closed',
'open',
'open',
'closed',
'open',
'open',
'open',
'open',
'open',
'open',
'open',
'closed',
'closed',
'closed',
'open',
'open',
'closed',
'closed',
'open',
'closed',
'closed',
'closed',
'open',
'closed',
'closed',
'closed',
'open',
'open',
'open',
'open',
'open',
'open',
'open',
'open',
'open',
'cancelled',
'open',
'open',
'open',
'closed',
'closed',
'open'],
'ticket_start_time': [Timestamp('2021-10-04 01:20:00'),
Timestamp('2021-10-02 00:00:00'),
Timestamp('2021-10-01 00:00:00'),
Timestamp('2021-10-04 00:00:00'),
Timestamp('2021-10-04 00:00:00'),
Timestamp('2021-10-01 00:00:00'),
Timestamp('2021-10-01 00:00:00'),
Timestamp('2021-10-01 00:00:00'),
Timestamp('2021-10-02 00:00:00'),
Timestamp('2021-10-03 00:00:00'),
Timestamp('2021-10-03 00:00:00'),
Timestamp('2021-10-03 00:00:00'),
Timestamp('2021-10-03 00:00:00'),
Timestamp('2021-10-03 00:00:00'),
Timestamp('2021-10-04 00:00:00'),
Timestamp('2021-10-04 00:00:00'),
Timestamp('2021-10-05 00:00:00'),
Timestamp('2021-10-05 00:00:00'),
Timestamp('2021-10-05 00:00:00'),
Timestamp('2021-10-01 00:00:00'),
Timestamp('2021-10-03 00:00:00'),
Timestamp('2021-10-04 00:00:00'),
Timestamp('2021-10-05 00:00:00'),
Timestamp('2021-10-05 00:00:00'),
Timestamp('2021-10-01 00:00:00'),
Timestamp('2021-10-01 00:00:00'),
Timestamp('2021-10-05 00:00:00'),
Timestamp('2021-10-01 00:00:00'),
Timestamp('2021-10-01 00:00:00'),
Timestamp('2021-10-01 00:00:00'),
Timestamp('2021-10-05 00:00:00'),
Timestamp('2021-10-01 00:00:00'),
Timestamp('2021-10-01 00:00:00'),
Timestamp('2021-10-01 00:00:00'),
Timestamp('2021-10-03 00:00:00'),
Timestamp('2021-10-03 00:00:00'),
Timestamp('2021-10-04 00:00:00'),
Timestamp('2021-10-04 00:00:00'),
Timestamp('2021-10-04 00:00:00'),
Timestamp('2021-10-04 00:00:00'),
Timestamp('2021-10-05 00:00:00'),
Timestamp('2021-10-05 00:00:00'),
Timestamp('2021-10-05 00:00:00'),
Timestamp('2021-10-02 00:00:00'),
Timestamp('2021-10-02 00:00:00'),
Timestamp('2021-10-04 00:00:00'),
Timestamp('2021-10-04 00:00:00'),
Timestamp('2021-10-04 00:00:00'),
Timestamp('2021-10-04 00:00:00'),
Timestamp('2021-10-01 00:00:00')],
'product_type': ['Broadband',
'Fixed Voice',
'Fixed Voice',
'Broadband',
'TV',
'TV',
'Broadband',
'Broadband',
'Broadband',
'Broadband',
'Broadband',
'Broadband',
'Broadband',
'Broadband',
'Broadband',
'Broadband',
'Broadband',
'Broadband',
'TV',
'TV',
'Broadband',
'Broadband',
'Broadband',
'Broadband',
'Fixed Voice',
'Broadband',
'Fixed Voice',
'TV',
'Broadband',
'Broadband',
'Broadband',
'Broadband',
'Broadband',
'Broadband',
'Broadband',
'TV',
'Fixed Voice',
'TV',
'TV',
'TV',
'TV',
'TV',
'Broadband',
'Broadband',
'Broadband',
'Broadband',
'Broadband',
'Broadband',
'Fixed Voice',
'Fixed Voice'],
'org_country': ['Grenada',
'Bahamas',
'St Lucia',
'St Lucia',
'Antigua',
'Jamaica',
'Jamaica',
'Jamaica',
'Jamaica',
'Jamaica',
'Jamaica',
'Jamaica',
'Jamaica',
'Jamaica',
'Jamaica',
'Jamaica',
'Jamaica',
'Jamaica',
'Jamaica',
'Trinidad',
'Trinidad',
'Trinidad',
'Trinidad',
'Curacao',
'Grenada',
'Grenada',
'Grenada',
'St Lucia',
'St Lucia',
'St Vincent',
'St Lucia',
'Jamaica',
'Jamaica',
'Jamaica',
'Jamaica',
'Jamaica',
'Jamaica',
'Jamaica',
'Jamaica',
'Jamaica',
'Jamaica',
'Jamaica',
'Jamaica',
'Trinidad',
'Trinidad',
'Trinidad',
'Trinidad',
'Curacao',
'Grenada',
'Bahamas']}
uj5u.com熱心網友回復:
修改答案
經過多次解釋,似乎 OP 正在尋找作為聚合結果的周開始和結束,而不是作為分組。
因此,這是修改后的答案:
weekday = 6 # Sunday
wo = pd.offsets.Week(weekday=weekday, normalize=True)
t = df['ticket_start_time'] wo
gbcols = ['org_country', 'product_type']
agg_dict = {
'ticket_id': 'count',
'week_startdate': 'first',
'week_enddate': 'last',
}
out = (
df
.assign(week_startdate=t - wo, week_enddate=t)
.groupby(['ticket_id', 'status'] gbcols).agg(agg_dict)
.query("status == 'closed'").groupby(gbcols).agg(agg_dict)
.reset_index()
.rename(columns={'ticket_id': 'Value'})
)
從提供的“輸入資料”中,我們得到:
>>> out
org_country product_type Value week_startdate week_enddate
0 Antigua TV 1 2021-10-03 2021-10-10
1 Curacao Broadband 1 2021-10-03 2021-10-10
2 Grenada Broadband 1 2021-09-26 2021-10-03
3 Grenada Fixed Voice 2 2021-09-26 2021-10-10
4 Jamaica Broadband 5 2021-09-26 2021-10-10
5 Jamaica TV 1 2021-09-26 2021-10-03
6 St Lucia Broadband 2 2021-09-26 2021-10-10
7 St Lucia Fixed Voice 1 2021-09-26 2021-10-03
8 St Lucia TV 1 2021-09-26 2021-10-03
9 St Vincent Broadband 1 2021-09-26 2021-10-03
10 Trinidad Broadband 2 2021-10-03 2021-10-10
11 Trinidad TV 1 2021-09-26 2021-10-03
解釋
首先,添加/然后減去周偏移量的技巧wo解決了當日期位于偏移量邊界時減去偏移量的問題。例如:
t = pd.to_datetime([
'2021-10-02', '2021-10-03', '2021-10-03 01:00',
'2021-10-04', '2021-10-09 23:59:59', '2021-10-10']
)
>>> pd.DataFrame(dict(t=t, ws=t wo - wo, we=t wo))
t ws we
0 2021-10-02 00:00:00 2021-09-26 2021-10-03
1 2021-10-03 00:00:00 2021-10-03 2021-10-10
2 2021-10-03 01:00:00 2021-10-03 2021-10-10
3 2021-10-04 00:00:00 2021-10-03 2021-10-10
4 2021-10-09 23:59:59 2021-10-03 2021-10-10
5 2021-10-10 00:00:00 2021-10-10 2021-10-17
接下來,我們希望ticket_id狀態所在的計數不同'closed'。這有點棘手。我們用兩個 groupby 來做。第一個是計算每個(ticket, status)(和所有其他分組列)組的計數。同一票證可能有多個實體打開或關閉,但我們想忽略這一點。
tmp = (
df
.assign(week_startdate=t - wo, week_enddate=t)
.groupby(['ticket_id', 'status'] gbcols)
.agg(agg_dict)
)
>>> tmp.head()
ticket_id week_startdate \
ticket_id status org_country product_type
692914 closed Curacao Broadband 1 2021-10-03
693170 open Curacao Broadband 1 2021-10-03
1259365 closed Grenada Fixed Voice 1 2021-09-26
1259466 closed Grenada Broadband 1 2021-09-26
1260057 open Grenada Broadband 1 2021-10-03
week_enddate
ticket_id status org_country product_type
692914 closed Curacao Broadband 2021-10-10
693170 open Curacao Broadband 2021-10-10
1259365 closed Grenada Fixed Voice 2021-10-03
1259466 closed Grenada Broadband 2021-10-03
1260057 open Grenada Broadband 2021-10-10
最后,我們現在再次簡單地聚合,但只對gbcols列進行分組,并且在該計數中狀態為“關閉”,從而為我們提供了上面的完整運算式。
附錄
OP 希望將week_startdate結束日期和結束日期轉換為具有完整H:M:S解析度的字串。為了這:
for k in ['week_startdate', 'week_enddate']:
out[k] = out[k].dt.strftime('%F %T')
現在:
>>> out
org_country product_type Value week_startdate week_enddate
0 Antigua TV 1 2021-10-03 00:00:00 2021-10-10 00:00:00
1 Curacao Broadband 1 2021-10-03 00:00:00 2021-10-10 00:00:00
2 Grenada Broadband 1 2021-09-26 00:00:00 2021-10-03 00:00:00
3 Grenada Fixed Voice 2 2021-09-26 00:00:00 2021-10-10 00:00:00
4 Jamaica Broadband 5 2021-09-26 00:00:00 2021-10-10 00:00:00
5 Jamaica TV 1 2021-09-26 00:00:00 2021-10-03 00:00:00
6 St Lucia Broadband 2 2021-09-26 00:00:00 2021-10-10 00:00:00
7 St Lucia Fixed Voice 1 2021-09-26 00:00:00 2021-10-03 00:00:00
8 St Lucia TV 1 2021-09-26 00:00:00 2021-10-03 00:00:00
9 St Vincent Broadband 1 2021-09-26 00:00:00 2021-10-03 00:00:00
10 Trinidad Broadband 2 2021-10-03 00:00:00 2021-10-10 00:00:00
11 Trinidad TV 1 2021-09-26 00:00:00 2021-10-03 00:00:00
uj5u.com熱心網友回復:
IUC:
def week_start_day(s):
return s - (pd.to_timedelta((s.dt.weekday 1)%7, unit='D'))
def week_end_day(s):
return s pd.to_timedelta(7-(s.dt.weekday 2)%7, unit='D')
df['ticket_time'] = pd.to_datetime(df['ticket_time'])
df = df[df['status'] == 'closed'].groupby(['country', 'product']).agg(week_startdate = pd.NamedAgg(column='ticket_time', aggfunc=week_start_day),
week_enddate = pd.NamedAgg(column='ticket_time', aggfunc=week_end_day),
Value = pd.NamedAgg(column='id', aggfunc="size"))
df.reset_index(inplace=True)
OUTPUT
country product week_startdate week_enddate Value
0 Antigua TV 2021-10-03 2021-10-09 1
1 St Lucia Broadband 2021-10-10 2021-10-16 1
2 St Lucia Fixed Voice 2021-10-03 2021-10-09 1
Setup
data = {'id': [1260057, 2998178, 3762949, 3766608, 3767125],
'status': ['open', 'open', 'closed', 'closed', 'closed'],
'ticket_time': ['2021-10-03 01:20:00', '2021-10-06 00:00:00', '2021-10-04 00:00:00',
'2021-10-10 00:00:00', '2021-10-03 00:00:00'],
'product': ['Broadband', 'Fixed Voice', 'Fixed Voice', 'Broadband', 'TV'],
'country': ['Grenada', 'Bahamas', 'St Lucia', 'St Lucia', 'Antigua']}
df = pd.DataFrame.from_dict(data)
轉載請註明出處,本文鏈接:https://www.uj5u.com/qiye/390628.html
標籤:Python 熊猫 数据框 pandas-groupby
上一篇:根據日期在資料框中創建一個新列
下一篇:列舉列中的值
