我對我的資料進行了一些復雜的轉換,我想知道是否有人有比我更有效的方法。我從一個資料框開始:
| | item | value |
|---:|:------------------|--------:|
| 0 | WAUZZZF23MN053792 | 0 |
| 1 | A | 1 |
| 2 | WF0TK3SS2MMA50940 | 0 |
| 3 | A | 10 |
| 4 | B | 11 |
| 5 | C | 12 |
| 6 | D | 13 |
| 7 | E | 14 |
| 8 | W0VEAZKXZMJ857138 | 0 |
| 9 | A | 20 |
| 10 | B | 21 |
| 11 | C | 22 |
| 12 | D | 23 |
| 13 | E | 24 |
| 14 | W0VEAZKXZMJ837930 | 0 |
| 15 | A | 30 |
| 16 | B | 31 |
| 17 | C | 32 |
| 18 | D | 33 |
| 19 | E | 34 |
我想到達這里:
| | item | value | C |
|---:|:------------------|--------:|----:|
| 0 | WAUZZZF23MN053792 | 0 | nan |
| 1 | WF0TK3SS2MMA50940 | 0 | 12 |
| 2 | W0VEAZKXZMJ857138 | 0 | 22 |
| 3 | W0VEAZKXZMJ837930 | 0 | 32 |
即對于每個“長”條目,檢查是否有一個專案“C”跟隨,如果有,則將該行的值復制到具有長專案的行。
我這樣做的丑陋方式如下:
import re
import pandas as pd
df = pd.DataFrame(
{
"item": [
"WAUZZZF23MN053792",
"A",
"WF0TK3SS2MMA50940",
"A",
"B",
"C",
"D",
"E",
"W0VEAZKXZMJ857138",
"A",
"B",
"C",
"D",
"E",
"W0VEAZKXZMJ837930",
"A",
"B",
"C",
"D",
"E",
],
"value": [
0,
1,
0,
10,
11,
12,
13,
14,
0,
20,
21,
22,
23,
24,
0,
30,
31,
32,
33,
33,
],
}
)
def isVIN(x):
return (len(x) == 17) & (x.upper() == x) & (re.search("\s|O|I", x) is None)
# filter the lines with item=="C" or a VIN in item
x = pd.concat([df, df["item"].rename("group").apply(isVIN).cumsum()], axis=1).loc[
lambda x: (x["item"] == "C") | (x["item"].apply(isVIN))
]
# pivot the lines where item="C"
y = x.loc[x["item"] == "C"].pivot(columns="item").droplevel(level=1, axis=1)
# and then merge the two:
print(
x.loc[x["item"].apply(isVIN)]
.merge(y, on="group", how="left")
.drop("group", axis=1)
.rename(columns={"value_y": "C", "value_x": "value"})
.to_markdown()
)
有誰知道如何使它不那么難看?
uj5u.com熱心網友回復:
主觀上不那么丑
mask = df.item.str.len().eq(17)
df.set_index(
[df.item.where(mask).ffill(), 'item']
)[~mask.to_numpy()].value.unstack()['C'].reset_index()
item C
0 W0VEAZKXZMJ837930 32.0
1 W0VEAZKXZMJ857138 22.0
2 WAUZZZF23MN053792 NaN
3 WF0TK3SS2MMA50940 12.0
更多參與但更好
mask = df.item.str.len().eq(17)
item = df.item.where(mask).pad()
subs = df.item.mask(mask)
valu = df.value
i, r = pd.factorize(item)
j, c = pd.factorize(subs)
a = np.zeros((len(r), len(c)), valu.dtype)
a[i, j] = valu
pd.DataFrame(a, r, c)[['C']].rename_axis('item').reset_index()
item C
0 WAUZZZF23MN053792 0
1 WF0TK3SS2MMA50940 12
2 W0VEAZKXZMJ857138 22
3 W0VEAZKXZMJ837930 32
uj5u.com熱心網友回復:
嘗試:
# Your conditions vectorized
m = ((df['item'].str.len() == 17)
& (df['item'].str.upper() == df['item'])
& (~df['item'].str.contains(r'\s|O|I')))
# Create virtual groups to align rows
df['grp'] = m.cumsum()
# Merge and align rows
out = (pd.concat([df[m].set_index('grp'),
df[~m].pivot('grp', 'item', 'value')], axis=1)
.reset_index(drop=True))
輸出:
>>> out
item value A B C D E
0 WAUZZZF23MN053792 0 1.0 NaN NaN NaN NaN
1 WF0TK3SS2MMA50940 0 10.0 11.0 12.0 13.0 14.0
2 W0VEAZKXZMJ857138 0 20.0 21.0 22.0 23.0 24.0
3 W0VEAZKXZMJ837930 0 30.0 31.0 32.0 33.0 33.0
uj5u.com熱心網友回復:
datar使用重新構想 pandas API 的 pandas 包裝器怎么樣:
構建資料
>>> import re
>>> from datar.all import (
... c, f, LETTERS, tibble, first, cumsum,
... mutate, group_by, slice, first, pivot_wider, select
... )
>>>
>>> df = tibble(
... item=c(
... "WAUZZZF23MN053792",
... "A",
... "WF0TK3SS2MMA50940",
... LETTERS[:5],
... "W0VEAZKXZMJ857138",
... LETTERS[:5],
... "W0VEAZKXZMJ837930",
... LETTERS[:5],
... ),
... value=c(
... 0, 1,
... 0, f[10:15],
... 0, f[20:25],
... 0, f[30:35],
... )
... )
>>> df
item value
<object> <int64>
0 WAUZZZF23MN053792 0
1 A 1
2 WF0TK3SS2MMA50940 0
3 A 10
4 B 11
5 C 12
6 D 13
7 E 14
8 W0VEAZKXZMJ857138 0
9 A 20
10 B 21
11 C 22
12 D 23
13 E 24
14 W0VEAZKXZMJ837930 0
15 A 30
16 B 31
17 C 32
18 D 33
19 E 34
操作資料
>>> def isVIN(x):
... return len(x) == 17 and x.isupper() and re.search(r"\s|O|I", x) is None
...
>>> (
... df
... # Mark the VIN groups
... >> mutate(is_vin=cumsum(f.item.transform(isVIN)))
... # Group by VINs
... >> group_by(f.is_vin)
... # Put the VINs and their values in new columns
... >> mutate(vin=first(f.item), vin_value=first(f.value))
... # Exclude VINs in the items
... >> slice(~c(0))
... # Get the values of A, B, C ...
... >> pivot_wider([f.vin, f.vin_value], names_from=f.item, values_from=f.value)
... # Select and rename columns
... >> select(item=f.vin, value=f.vin_value, C=f.C)
... )
item value C
<object> <int64> <float64>
0 W0VEAZKXZMJ837930 0 32.0
1 W0VEAZKXZMJ857138 0 22.0
2 WAUZZZF23MN053792 0 NaN
3 WF0TK3SS2MMA50940 0 12.0
uj5u.com熱心網友回復:
其他答案都很好。對于更多種類,您還可以過濾df“長”資料和C值;concat; groupby然后使用 “壓縮” DataFrame first:
out = pd.concat([df[df['item'].str.len()==17],
df.loc[df['item']=='C', ['value']].set_axis(['C'], axis=1)], axis=1)
out = out.groupby(out['item'].str.len().eq(17).cumsum()).first().reset_index(drop=True)
輸出:
item value C
0 WAUZZZF23MN053792 0.0 NaN
1 WF0TK3SS2MMA50940 0.0 12.0
2 W0VEAZKXZMJ857138 0.0 22.0
3 W0VEAZKXZMJ837930 0.0 32.0
轉載請註明出處,本文鏈接:https://www.uj5u.com/yidong/453893.html
