我收到包含鞋子及其品牌庫存的檔案,我需要知道第一次看到鞋子是什么時候,以及最后一次看到每個品牌的鞋子是什么時候。最后一次看到它的時間應該計算為我們收到檔案并且鞋子不在那里的日期。
共有三列:
- model_id:鞋子的型號
- 品牌:某節目的品牌
- release_date:啟動目錄檔案的時間。
如果我有一雙在 release_date 1,2 上看到的品牌的鞋,但在 3 上不再出現,那么 min_date 將為 1,max_date 將為 3。
如果在給定時間 release_date 是該品牌可用的 max_date,則 min_date 仍將是最小值,但 max_date 將是 Nat。
min_date 永遠不應大于 max_date。
有輸入
data = pd.DataFrame({'model_id':[2,1,2,3],'brand':['nike','nike','nike','adidas'],'release_date':['01/03/2021','01/02/2021','01/01/2021','01/04/2021']}).sort_values('model_id')
'| | model_id | brand | release_date |
|---:|-----------:|:--------|:---------------|
| 1 | 1 | nike | 01/02/2021 |
| 0 | 2 | nike | 01/03/2021 |
| 2 | 2 | nike | 01/01/2021 |
| 3 | 3 | adidas | 01/04/2021 |'
應該導致:
'| | model_id | brand | min_date | max_date |
|---:|-----------:|:--------|:---------------|:---------------|
| 1 | 1 | nike | 01/02/2021 | 01/03/2021 |
| 0 | 2 | nike | 01/01/2021 | NaT |
| 3 | 3 | adidas | 01/04/2021 | NaT |'
The first row has min_date as the minimal date that appeared in a catalog and max_date because is picking the max(release_date) from the same brand.
The second row has min_date as the min(release_date) but max_date is NaT because it has the max(release_date) for its brand, in order to have a date there should be another max(release_date) for the same brand.
Ergo for the third row.
The logic I am having hard time implementing is for the max_date, whose has to be the value of the max(release_date) for every brand, unless that model_id has that very same release_date, then a missing value.
Tried everything in this question without any success because I didnt explain myself well enough
Pandas group by two fields, pick min date and next max date from other group
uj5u.com熱心網友回復:
這可以滿足您的要求 - 首先按 model_id 和日期排序,然后groupby->first獲取第一個日期,然后使用每個品牌的最大日期系列并檢查 release_date 是否大于它。
shoes = pd.DataFrame(
{
"model_id": [2, 1, 2, 3],
"brand": ["nike", "nike", "nike", "adidas"],
"release_date": ["01/03/2021", "01/02/2021", "01/01/2021", "01/04/2021"],
}
)
shoes.release_date = pd.to_datetime(shoes.release_date)
shoes = (shoes
.sort_values(["model_id", "release_date"])
.groupby("model_id").first())
# find the "max_date" for each brand
max_dates_col = shoes.brand.map(
shoes.groupby("brand")["release_date"].max().to_dict())
shoes["max_date"] = max_dates_col.where(
shoes.release_date < max_dates_col, other=pd.NaT
)
shoes = shoes.rename(columns={"release_date": "min_date"})
轉載請註明出處,本文鏈接:https://www.uj5u.com/gongcheng/456786.html
上一篇:如何防止Matplotlib地圖中注釋中的標簽重疊?
下一篇:將具有串列作為值的兩列合并為一列
